1 00:00:06,620 --> 00:00:10,530 - One of the most important things you must do 2 00:00:10,530 --> 00:00:15,530 when benchmarking is validate that the results are accurate. 3 00:00:15,640 --> 00:00:18,270 I cannot stress this enough. 4 00:00:18,270 --> 00:00:20,990 Use your gut if you have to, 5 00:00:20,990 --> 00:00:24,810 but you can't just blindly trust what you're seeing. 6 00:00:24,810 --> 00:00:28,320 Remember, we were just talking about, that your machine 7 00:00:28,320 --> 00:00:30,300 must be idle, and there's lots of reasons why 8 00:00:30,300 --> 00:00:31,980 your machine might not be idle. 9 00:00:31,980 --> 00:00:33,400 I'm on my MacBook. 10 00:00:33,400 --> 00:00:36,270 Even though I've got most of the things turned off, 11 00:00:36,270 --> 00:00:38,480 there's still background programs running, 12 00:00:38,480 --> 00:00:39,910 there's still network being used. 13 00:00:39,910 --> 00:00:43,160 Then, a lot of times, you've got your CPUs. 14 00:00:43,160 --> 00:00:46,310 If they have thermal scaling or they do things 15 00:00:46,310 --> 00:00:49,110 where they don't go full throttle all the time. 16 00:00:49,110 --> 00:00:51,560 All these little things are gonna matter when 17 00:00:51,560 --> 00:00:54,240 you're running benchmarks, and you've gotta validate 18 00:00:54,240 --> 00:00:56,720 your results at all time. 19 00:00:56,720 --> 00:00:59,200 There was this article I read one time 20 00:00:59,200 --> 00:01:01,003 that was using merge sort. 21 00:01:01,840 --> 00:01:04,530 Merge sort is like the worst sorting algorithm 22 00:01:04,530 --> 00:01:07,410 you could use in Go, because the way merge sort works 23 00:01:07,410 --> 00:01:10,600 is you take this list of, let's say, a million integers, 24 00:01:10,600 --> 00:01:13,200 you cut it in half, you cut it in half, you cut it in half, 25 00:01:13,200 --> 00:01:16,370 you're cutting every half in half til you get to the bottom. 26 00:01:16,370 --> 00:01:18,030 Then, you do your compares. 27 00:01:18,030 --> 00:01:21,230 You might swap, but on that swapping, 28 00:01:21,230 --> 00:01:22,430 after that you're merging. 29 00:01:22,430 --> 00:01:24,780 Merging lists back, merging lists back, merging. 30 00:01:24,780 --> 00:01:29,230 It's really an allocation hell type of algorithm, 31 00:01:29,230 --> 00:01:31,340 and really one of these algorithms that you don't wanna 32 00:01:31,340 --> 00:01:33,080 be using in Go unless you're gonna write some 33 00:01:33,080 --> 00:01:34,743 very, very clever code. 34 00:01:36,380 --> 00:01:39,280 The author said we wanna use merge sort, 35 00:01:39,280 --> 00:01:41,057 and what I wanted to see, the author's like 36 00:01:41,057 --> 00:01:43,937 "What I want to see is that if I do a merge sort 37 00:01:43,937 --> 00:01:47,437 "over a million integers and I use one goroutine, 38 00:01:47,437 --> 00:01:48,737 "how fast is that? 39 00:01:48,737 --> 00:01:53,237 "If I use a goroutine every time I split that list 40 00:01:53,237 --> 00:01:56,817 "in half, I throw a different goroutine at each half. 41 00:01:56,817 --> 00:01:59,077 "An unlimited number of goroutines, depending on the size 42 00:01:59,077 --> 00:02:00,737 "of the number of things that we're sorting. 43 00:02:00,737 --> 00:02:04,747 "Is that faster, or if I use just enough goroutines 44 00:02:04,747 --> 00:02:05,767 "that I have cores. 45 00:02:05,767 --> 00:02:06,987 "I have an eight-core machine, 46 00:02:06,987 --> 00:02:09,047 "never use more than eight goroutines. 47 00:02:09,047 --> 00:02:10,340 "Is that the fastest?" 48 00:02:10,340 --> 00:02:13,360 The author wanted to know what was the fastest way to go 49 00:02:13,360 --> 00:02:15,810 and the conclusion of the post was that, 50 00:02:15,810 --> 00:02:17,860 at least when it came to merge sort, 51 00:02:17,860 --> 00:02:22,200 that only one goroutine was the fastest way to do sorting. 52 00:02:22,200 --> 00:02:26,040 Now, I love the messaging, because the message is 53 00:02:26,040 --> 00:02:28,340 just because you throw more goroutines at a problem 54 00:02:28,340 --> 00:02:30,760 doesn't necessarily mean it's going to run faster, 55 00:02:30,760 --> 00:02:33,950 especially if you're not leveraging mechanical sympathy. 56 00:02:33,950 --> 00:02:36,380 Some of the things that we talked about before. 57 00:02:36,380 --> 00:02:40,850 But, it hurt my brain, because I could not fathom how 58 00:02:40,850 --> 00:02:45,040 one goroutine could be faster than eight goroutines 59 00:02:45,040 --> 00:02:45,873 on my machine. 60 00:02:45,873 --> 00:02:48,260 It's just, my brain said there's no way. 61 00:02:48,260 --> 00:02:51,420 Some level of parallel processing for what is really going 62 00:02:51,420 --> 00:02:55,853 to be CPU-bound work had to be faster. 63 00:02:58,220 --> 00:03:00,400 I said, you know what, I don't get it. 64 00:03:00,400 --> 00:03:02,640 So I went ahead and I wrote merge sort. 65 00:03:02,640 --> 00:03:04,430 Now, I'm not gonna go through the merge sort 66 00:03:04,430 --> 00:03:06,110 code here, that's non-import. 67 00:03:06,110 --> 00:03:08,100 What's important is I have an implementation 68 00:03:08,100 --> 00:03:10,930 of merge sort, just like the author's implementation, 69 00:03:10,930 --> 00:03:13,600 and what I did is I went over and, 70 00:03:13,600 --> 00:03:15,730 we're gonna start with that million integers. 71 00:03:15,730 --> 00:03:17,420 I wrote three benchmarks. 72 00:03:17,420 --> 00:03:20,117 Single, unlimited, and NumCPU. 73 00:03:20,117 --> 00:03:23,250 One goroutine, unlimited number of goroutines, 74 00:03:23,250 --> 00:03:25,660 a goroutine for every split, and then just a number 75 00:03:25,660 --> 00:03:28,960 of goroutines for the number of cores that I have. 76 00:03:28,960 --> 00:03:31,820 I reconstructed everything that the author did 77 00:03:31,820 --> 00:03:34,970 and I'm gonna go ahead and run this benchmark 78 00:03:34,970 --> 00:03:38,800 just like the author did in the article. 79 00:03:38,800 --> 00:03:41,280 I am in the right place. 80 00:03:41,280 --> 00:03:42,350 Go test. 81 00:03:42,350 --> 00:03:45,230 We're gonna run no functions, 82 00:03:45,230 --> 00:03:47,690 run all the benchmarks, 83 00:03:47,690 --> 00:03:50,120 increase the bench time to three seconds, 84 00:03:50,120 --> 00:03:51,850 and we're not looking at allocations 85 00:03:51,850 --> 00:03:53,750 because that's going to be nasty. 86 00:03:53,750 --> 00:03:55,650 We're looking at CPU performance here. 87 00:03:56,760 --> 00:03:58,060 There I am. 88 00:03:58,060 --> 00:04:02,320 We got a single goroutine sorting a million integers 89 00:04:02,320 --> 00:04:06,670 and every three is that millisecond. 90 00:04:06,670 --> 00:04:10,140 This is 82 milliseconds it took. 91 00:04:10,140 --> 00:04:12,330 We were only able to do this 50 times. 92 00:04:12,330 --> 00:04:14,800 82 milliseconds on a single goroutine 93 00:04:14,800 --> 00:04:17,210 to sort those million integers. 94 00:04:17,210 --> 00:04:22,210 This was 635 milliseconds to do the same sort 95 00:04:22,960 --> 00:04:24,710 with an unlimited number of goroutines. 96 00:04:24,710 --> 00:04:26,640 Look, when I just kept throwing goroutines 97 00:04:26,640 --> 00:04:30,220 at the problem, it ran significantly slower. 98 00:04:30,220 --> 00:04:32,960 Then, when I gave it eight goroutines, 99 00:04:32,960 --> 00:04:34,770 dash eight, we've got eight cores, 100 00:04:34,770 --> 00:04:37,160 when I gave it eight goroutines, 101 00:04:37,160 --> 00:04:40,420 it actually took longer than just using one. 102 00:04:40,420 --> 00:04:43,410 This was the conclusion that the author 103 00:04:43,410 --> 00:04:45,713 had, as well, in the post. 104 00:04:46,790 --> 00:04:48,400 I guess we should just walk away. 105 00:04:48,400 --> 00:04:49,970 My gut is wrong. 106 00:04:49,970 --> 00:04:53,740 I guess, truly, when it comes to merge sort, 107 00:04:53,740 --> 00:04:56,630 that those allocations are causing so much pain 108 00:04:56,630 --> 00:04:59,640 that running it, even in parallel is worse 109 00:04:59,640 --> 00:05:01,200 than running single goroutine. 110 00:05:01,200 --> 00:05:02,520 OK, we're done. 111 00:05:02,520 --> 00:05:03,353 No, no, no, no, no. 112 00:05:03,353 --> 00:05:04,200 We're not done. 113 00:05:04,200 --> 00:05:05,760 I still don't believe it. 114 00:05:05,760 --> 00:05:07,530 This is what I'm talking about with validate. 115 00:05:07,530 --> 00:05:11,030 You must validate your results, and this is killing me. 116 00:05:11,030 --> 00:05:13,910 This has killed me for like two hours at this point 117 00:05:13,910 --> 00:05:16,120 and I won't let it go. 118 00:05:16,120 --> 00:05:18,870 I've done enough multi-threaded software development, 119 00:05:18,870 --> 00:05:20,940 I've written the algorithms, I know I've got 120 00:05:20,940 --> 00:05:23,700 the mechanical sympathies, to the extent I can, in place. 121 00:05:23,700 --> 00:05:25,790 This is impossible for me. 122 00:05:25,790 --> 00:05:27,803 I'm not willing to let this go. 123 00:05:28,790 --> 00:05:30,833 Then, I do the following. 124 00:05:32,470 --> 00:05:37,330 Let's just isolate running NumCPU by itself. 125 00:05:40,350 --> 00:05:42,820 Look at the result we get. 126 00:05:42,820 --> 00:05:47,820 Suddenly, we went from 95 milliseconds to 55 milliseconds. 127 00:05:48,720 --> 00:05:53,560 Look, NumCPU is now faster, from my perspective, 128 00:05:53,560 --> 00:05:56,910 30 milliseconds, significantly faster 129 00:05:56,910 --> 00:05:58,280 than a single goroutine, 130 00:05:58,280 --> 00:06:01,870 which is exactly what I expected to see. 131 00:06:01,870 --> 00:06:02,860 What happened here? 132 00:06:02,860 --> 00:06:05,850 Why is NumCPU suddenly faster? 133 00:06:05,850 --> 00:06:08,670 Because, we're gonna go back to the idea that your machine 134 00:06:08,670 --> 00:06:12,740 must be idle when you're running these benchmarks. 135 00:06:12,740 --> 00:06:13,930 Think about it. 136 00:06:13,930 --> 00:06:17,810 We ran single on my machine while it was 137 00:06:17,810 --> 00:06:19,900 as idle as it could be. 138 00:06:19,900 --> 00:06:23,480 Then, we ran unlimited on my machine when it was 139 00:06:23,480 --> 00:06:25,470 as idle as it could be. 140 00:06:25,470 --> 00:06:28,690 But when we started running NumCPU in this batch, 141 00:06:28,690 --> 00:06:33,110 the run time in the machine was still busy cleaning up 142 00:06:33,110 --> 00:06:35,840 the mess we made with the unlimited test. 143 00:06:35,840 --> 00:06:37,810 In other words, we had created so many goroutines 144 00:06:37,810 --> 00:06:41,410 that hadn't died yet, the scheduler and the run time 145 00:06:41,410 --> 00:06:43,320 was still dealing with all of that 146 00:06:43,320 --> 00:06:46,970 when we started the NumCPU profile. 147 00:06:46,970 --> 00:06:51,440 Suddenly, now, the cost of the clean up is affecting 148 00:06:51,440 --> 00:06:54,160 our performance numbers with NumCPU. 149 00:06:54,160 --> 00:06:56,660 When I bring our machine back to an idle state 150 00:06:56,660 --> 00:06:59,730 and I just run NumCPU, suddenly we get 151 00:06:59,730 --> 00:07:01,883 a much more accurate result. 152 00:07:02,760 --> 00:07:07,670 I like the idea that throwing goroutines at a problem 153 00:07:07,670 --> 00:07:10,890 doesn't mean it makes it necessarily run any faster. 154 00:07:10,890 --> 00:07:14,360 But, when we can run things in parallel efficiently 155 00:07:14,360 --> 00:07:16,730 with mechanical sympathies, it really should 156 00:07:16,730 --> 00:07:19,920 be faster than just using a single goroutine on one core 157 00:07:19,920 --> 00:07:22,350 and now we've really proven that. 158 00:07:22,350 --> 00:07:26,150 My point here is that we must always validate our results. 159 00:07:26,150 --> 00:07:29,810 We should have some kind of understanding or be able 160 00:07:29,810 --> 00:07:32,400 to predict in some way what we expect to see, 161 00:07:32,400 --> 00:07:34,710 and when we don't see it, we've gotta make sure 162 00:07:34,710 --> 00:07:37,620 that we understand it and that it's not us doing 163 00:07:37,620 --> 00:07:40,603 something wrong, but it's really accurate.