1
00:00:06,620 --> 00:00:10,530
- One of the most important
things you must do

2
00:00:10,530 --> 00:00:15,530
when benchmarking is validate
that the results are accurate.

3
00:00:15,640 --> 00:00:18,270
I cannot stress this enough.

4
00:00:18,270 --> 00:00:20,990
Use your gut if you have to,

5
00:00:20,990 --> 00:00:24,810
but you can't just blindly
trust what you're seeing.

6
00:00:24,810 --> 00:00:28,320
Remember, we were just talking
about, that your machine

7
00:00:28,320 --> 00:00:30,300
must be idle, and there's
lots of reasons why

8
00:00:30,300 --> 00:00:31,980
your machine might not be idle.

9
00:00:31,980 --> 00:00:33,400
I'm on my MacBook.

10
00:00:33,400 --> 00:00:36,270
Even though I've got most
of the things turned off,

11
00:00:36,270 --> 00:00:38,480
there's still background programs running,

12
00:00:38,480 --> 00:00:39,910
there's still network being used.

13
00:00:39,910 --> 00:00:43,160
Then, a lot of times,
you've got your CPUs.

14
00:00:43,160 --> 00:00:46,310
If they have thermal
scaling or they do things

15
00:00:46,310 --> 00:00:49,110
where they don't go full
throttle all the time.

16
00:00:49,110 --> 00:00:51,560
All these little things
are gonna matter when

17
00:00:51,560 --> 00:00:54,240
you're running benchmarks,
and you've gotta validate

18
00:00:54,240 --> 00:00:56,720
your results at all time.

19
00:00:56,720 --> 00:00:59,200
There was this article I read one time

20
00:00:59,200 --> 00:01:01,003
that was using merge sort.

21
00:01:01,840 --> 00:01:04,530
Merge sort is like the
worst sorting algorithm

22
00:01:04,530 --> 00:01:07,410
you could use in Go, because
the way merge sort works

23
00:01:07,410 --> 00:01:10,600
is you take this list of,
let's say, a million integers,

24
00:01:10,600 --> 00:01:13,200
you cut it in half, you cut it
in half, you cut it in half,

25
00:01:13,200 --> 00:01:16,370
you're cutting every half in
half til you get to the bottom.

26
00:01:16,370 --> 00:01:18,030
Then, you do your compares.

27
00:01:18,030 --> 00:01:21,230
You might swap, but on that swapping,

28
00:01:21,230 --> 00:01:22,430
after that you're merging.

29
00:01:22,430 --> 00:01:24,780
Merging lists back, merging
lists back, merging.

30
00:01:24,780 --> 00:01:29,230
It's really an allocation
hell type of algorithm,

31
00:01:29,230 --> 00:01:31,340
and really one of these
algorithms that you don't wanna

32
00:01:31,340 --> 00:01:33,080
be using in Go unless
you're gonna write some

33
00:01:33,080 --> 00:01:34,743
very, very clever code.

34
00:01:36,380 --> 00:01:39,280
The author said we wanna use merge sort,

35
00:01:39,280 --> 00:01:41,057
and what I wanted to
see, the author's like

36
00:01:41,057 --> 00:01:43,937
"What I want to see is
that if I do a merge sort

37
00:01:43,937 --> 00:01:47,437
"over a million integers
and I use one goroutine,

38
00:01:47,437 --> 00:01:48,737
"how fast is that?

39
00:01:48,737 --> 00:01:53,237
"If I use a goroutine every
time I split that list

40
00:01:53,237 --> 00:01:56,817
"in half, I throw a different
goroutine at each half.

41
00:01:56,817 --> 00:01:59,077
"An unlimited number of
goroutines, depending on the size

42
00:01:59,077 --> 00:02:00,737
"of the number of things
that we're sorting.

43
00:02:00,737 --> 00:02:04,747
"Is that faster, or if I
use just enough goroutines

44
00:02:04,747 --> 00:02:05,767
"that I have cores.

45
00:02:05,767 --> 00:02:06,987
"I have an eight-core machine,

46
00:02:06,987 --> 00:02:09,047
"never use more than eight goroutines.

47
00:02:09,047 --> 00:02:10,340
"Is that the fastest?"

48
00:02:10,340 --> 00:02:13,360
The author wanted to know
what was the fastest way to go

49
00:02:13,360 --> 00:02:15,810
and the conclusion of the post was that,

50
00:02:15,810 --> 00:02:17,860
at least when it came to merge sort,

51
00:02:17,860 --> 00:02:22,200
that only one goroutine was
the fastest way to do sorting.

52
00:02:22,200 --> 00:02:26,040
Now, I love the messaging,
because the message is

53
00:02:26,040 --> 00:02:28,340
just because you throw more
goroutines at a problem

54
00:02:28,340 --> 00:02:30,760
doesn't necessarily mean
it's going to run faster,

55
00:02:30,760 --> 00:02:33,950
especially if you're not
leveraging mechanical sympathy.

56
00:02:33,950 --> 00:02:36,380
Some of the things that
we talked about before.

57
00:02:36,380 --> 00:02:40,850
But, it hurt my brain,
because I could not fathom how

58
00:02:40,850 --> 00:02:45,040
one goroutine could be
faster than eight goroutines

59
00:02:45,040 --> 00:02:45,873
on my machine.

60
00:02:45,873 --> 00:02:48,260
It's just, my brain said there's no way.

61
00:02:48,260 --> 00:02:51,420
Some level of parallel processing
for what is really going

62
00:02:51,420 --> 00:02:55,853
to be CPU-bound work had to be faster.

63
00:02:58,220 --> 00:03:00,400
I said, you know what, I don't get it.

64
00:03:00,400 --> 00:03:02,640
So I went ahead and I wrote merge sort.

65
00:03:02,640 --> 00:03:04,430
Now, I'm not gonna go
through the merge sort

66
00:03:04,430 --> 00:03:06,110
code here, that's non-import.

67
00:03:06,110 --> 00:03:08,100
What's important is I
have an implementation

68
00:03:08,100 --> 00:03:10,930
of merge sort, just like
the author's implementation,

69
00:03:10,930 --> 00:03:13,600
and what I did is I went over and,

70
00:03:13,600 --> 00:03:15,730
we're gonna start with
that million integers.

71
00:03:15,730 --> 00:03:17,420
I wrote three benchmarks.

72
00:03:17,420 --> 00:03:20,117
Single, unlimited, and NumCPU.

73
00:03:20,117 --> 00:03:23,250
One goroutine, unlimited
number of goroutines,

74
00:03:23,250 --> 00:03:25,660
a goroutine for every split,
and then just a number

75
00:03:25,660 --> 00:03:28,960
of goroutines for the
number of cores that I have.

76
00:03:28,960 --> 00:03:31,820
I reconstructed everything
that the author did

77
00:03:31,820 --> 00:03:34,970
and I'm gonna go ahead
and run this benchmark

78
00:03:34,970 --> 00:03:38,800
just like the author did in the article.

79
00:03:38,800 --> 00:03:41,280
I am in the right place.

80
00:03:41,280 --> 00:03:42,350
Go test.

81
00:03:42,350 --> 00:03:45,230
We're gonna run no functions,

82
00:03:45,230 --> 00:03:47,690
run all the benchmarks,

83
00:03:47,690 --> 00:03:50,120
increase the bench time to three seconds,

84
00:03:50,120 --> 00:03:51,850
and we're not looking at allocations

85
00:03:51,850 --> 00:03:53,750
because that's going to be nasty.

86
00:03:53,750 --> 00:03:55,650
We're looking at CPU performance here.

87
00:03:56,760 --> 00:03:58,060
There I am.

88
00:03:58,060 --> 00:04:02,320
We got a single goroutine
sorting a million integers

89
00:04:02,320 --> 00:04:06,670
and every three is that millisecond.

90
00:04:06,670 --> 00:04:10,140
This is 82 milliseconds it took.

91
00:04:10,140 --> 00:04:12,330
We were only able to do this 50 times.

92
00:04:12,330 --> 00:04:14,800
82 milliseconds on a single goroutine

93
00:04:14,800 --> 00:04:17,210
to sort those million integers.

94
00:04:17,210 --> 00:04:22,210
This was 635 milliseconds
to do the same sort

95
00:04:22,960 --> 00:04:24,710
with an unlimited number of goroutines.

96
00:04:24,710 --> 00:04:26,640
Look, when I just kept throwing goroutines

97
00:04:26,640 --> 00:04:30,220
at the problem, it ran
significantly slower.

98
00:04:30,220 --> 00:04:32,960
Then, when I gave it eight goroutines,

99
00:04:32,960 --> 00:04:34,770
dash eight, we've got eight cores,

100
00:04:34,770 --> 00:04:37,160
when I gave it eight goroutines,

101
00:04:37,160 --> 00:04:40,420
it actually took longer
than just using one.

102
00:04:40,420 --> 00:04:43,410
This was the conclusion that the author

103
00:04:43,410 --> 00:04:45,713
had, as well, in the post.

104
00:04:46,790 --> 00:04:48,400
I guess we should just walk away.

105
00:04:48,400 --> 00:04:49,970
My gut is wrong.

106
00:04:49,970 --> 00:04:53,740
I guess, truly, when
it comes to merge sort,

107
00:04:53,740 --> 00:04:56,630
that those allocations
are causing so much pain

108
00:04:56,630 --> 00:04:59,640
that running it, even in parallel is worse

109
00:04:59,640 --> 00:05:01,200
than running single goroutine.

110
00:05:01,200 --> 00:05:02,520
OK, we're done.

111
00:05:02,520 --> 00:05:03,353
No, no, no, no, no.

112
00:05:03,353 --> 00:05:04,200
We're not done.

113
00:05:04,200 --> 00:05:05,760
I still don't believe it.

114
00:05:05,760 --> 00:05:07,530
This is what I'm talking
about with validate.

115
00:05:07,530 --> 00:05:11,030
You must validate your results,
and this is killing me.

116
00:05:11,030 --> 00:05:13,910
This has killed me for like
two hours at this point

117
00:05:13,910 --> 00:05:16,120
and I won't let it go.

118
00:05:16,120 --> 00:05:18,870
I've done enough multi-threaded
software development,

119
00:05:18,870 --> 00:05:20,940
I've written the
algorithms, I know I've got

120
00:05:20,940 --> 00:05:23,700
the mechanical sympathies, to
the extent I can, in place.

121
00:05:23,700 --> 00:05:25,790
This is impossible for me.

122
00:05:25,790 --> 00:05:27,803
I'm not willing to let this go.

123
00:05:28,790 --> 00:05:30,833
Then, I do the following.

124
00:05:32,470 --> 00:05:37,330
Let's just isolate
running NumCPU by itself.

125
00:05:40,350 --> 00:05:42,820
Look at the result we get.

126
00:05:42,820 --> 00:05:47,820
Suddenly, we went from 95
milliseconds to 55 milliseconds.

127
00:05:48,720 --> 00:05:53,560
Look, NumCPU is now faster,
from my perspective,

128
00:05:53,560 --> 00:05:56,910
30 milliseconds, significantly faster

129
00:05:56,910 --> 00:05:58,280
than a single goroutine,

130
00:05:58,280 --> 00:06:01,870
which is exactly what I expected to see.

131
00:06:01,870 --> 00:06:02,860
What happened here?

132
00:06:02,860 --> 00:06:05,850
Why is NumCPU suddenly faster?

133
00:06:05,850 --> 00:06:08,670
Because, we're gonna go back
to the idea that your machine

134
00:06:08,670 --> 00:06:12,740
must be idle when you're
running these benchmarks.

135
00:06:12,740 --> 00:06:13,930
Think about it.

136
00:06:13,930 --> 00:06:17,810
We ran single on my machine while it was

137
00:06:17,810 --> 00:06:19,900
as idle as it could be.

138
00:06:19,900 --> 00:06:23,480
Then, we ran unlimited
on my machine when it was

139
00:06:23,480 --> 00:06:25,470
as idle as it could be.

140
00:06:25,470 --> 00:06:28,690
But when we started running
NumCPU in this batch,

141
00:06:28,690 --> 00:06:33,110
the run time in the machine
was still busy cleaning up

142
00:06:33,110 --> 00:06:35,840
the mess we made with the unlimited test.

143
00:06:35,840 --> 00:06:37,810
In other words, we had
created so many goroutines

144
00:06:37,810 --> 00:06:41,410
that hadn't died yet, the
scheduler and the run time

145
00:06:41,410 --> 00:06:43,320
was still dealing with all of that

146
00:06:43,320 --> 00:06:46,970
when we started the NumCPU profile.

147
00:06:46,970 --> 00:06:51,440
Suddenly, now, the cost of
the clean up is affecting

148
00:06:51,440 --> 00:06:54,160
our performance numbers with NumCPU.

149
00:06:54,160 --> 00:06:56,660
When I bring our machine
back to an idle state

150
00:06:56,660 --> 00:06:59,730
and I just run NumCPU, suddenly we get

151
00:06:59,730 --> 00:07:01,883
a much more accurate result.

152
00:07:02,760 --> 00:07:07,670
I like the idea that throwing
goroutines at a problem

153
00:07:07,670 --> 00:07:10,890
doesn't mean it makes it
necessarily run any faster.

154
00:07:10,890 --> 00:07:14,360
But, when we can run things
in parallel efficiently

155
00:07:14,360 --> 00:07:16,730
with mechanical sympathies,
it really should

156
00:07:16,730 --> 00:07:19,920
be faster than just using a
single goroutine on one core

157
00:07:19,920 --> 00:07:22,350
and now we've really proven that.

158
00:07:22,350 --> 00:07:26,150
My point here is that we must
always validate our results.

159
00:07:26,150 --> 00:07:29,810
We should have some kind
of understanding or be able

160
00:07:29,810 --> 00:07:32,400
to predict in some way
what we expect to see,

161
00:07:32,400 --> 00:07:34,710
and when we don't see
it, we've gotta make sure

162
00:07:34,710 --> 00:07:37,620
that we understand it and
that it's not us doing

163
00:07:37,620 --> 00:07:40,603
something wrong, but it's really accurate.