1
00:00:06,580 --> 00:00:08,890
- Okay we're now about
to talk about data races.

2
00:00:08,890 --> 00:00:11,310
And what a data race is is when you've got

3
00:00:11,310 --> 00:00:14,976
at least two paths of
execution, like two Go routines,

4
00:00:14,976 --> 00:00:18,667
accessing the same memory
location at the same time,

5
00:00:18,667 --> 00:00:20,640
where one is doing a read,

6
00:00:20,640 --> 00:00:22,370
and the other is at least doing a write.

7
00:00:22,370 --> 00:00:24,520
I mean they both could
me doing writes too,

8
00:00:24,520 --> 00:00:26,170
and that would be a data race.

9
00:00:26,170 --> 00:00:28,430
And that is bad, okay?

10
00:00:28,430 --> 00:00:31,860
You cannot be mutating memory okay?

11
00:00:31,860 --> 00:00:34,000
It's two paths of
execution at the same time,

12
00:00:34,000 --> 00:00:35,509
we're gonna have data corruption.

13
00:00:35,509 --> 00:00:39,610
And this is where
synchronization comes in.

14
00:00:39,610 --> 00:00:42,030
There's two things here,
synchronization and orchestration.

15
00:00:42,030 --> 00:00:43,870
We're focusing on synchronization

16
00:00:43,870 --> 00:00:45,870
right now in this data race section.

17
00:00:45,870 --> 00:00:47,380
The best way to think
about synchronization

18
00:00:47,380 --> 00:00:49,580
is if you went to Starbucks okay,

19
00:00:49,580 --> 00:00:51,830
and you got in line 'cause
you wanna get some coffee.

20
00:00:51,830 --> 00:00:52,699
And now you're in line,

21
00:00:52,699 --> 00:00:56,070
you're waiting for your turn
to get up to the counter.

22
00:00:56,070 --> 00:00:59,300
Anytime Go routines have to get in line,

23
00:00:59,300 --> 00:01:01,620
that is a synchronization issue.

24
00:01:01,620 --> 00:01:03,000
But once you get to the counter

25
00:01:03,000 --> 00:01:05,580
and you start talking to
the person at the register,

26
00:01:05,580 --> 00:01:07,520
you now have an orchestration issue.

27
00:01:07,520 --> 00:01:10,050
We're having a conversation,
we're exchanging money,

28
00:01:10,050 --> 00:01:11,760
there's data going back and forth.

29
00:01:11,760 --> 00:01:13,030
That is orchestration.

30
00:01:13,030 --> 00:01:14,570
And we have these two problems,

31
00:01:14,570 --> 00:01:16,050
synchronization and orchestration.

32
00:01:16,050 --> 00:01:17,430
And your biggest job is to know

33
00:01:17,430 --> 00:01:19,100
when it's a synchronization issue,

34
00:01:19,100 --> 00:01:20,370
when do we have to get in line?

35
00:01:20,370 --> 00:01:22,020
And when it's an orchestration issue,

36
00:01:22,020 --> 00:01:26,220
when is there a work flow going on.

37
00:01:26,220 --> 00:01:28,060
Now again, a data race is when

38
00:01:28,060 --> 00:01:29,480
you have two or more Go routines

39
00:01:29,480 --> 00:01:31,490
where one's doing a read and
one is doing a write to the

40
00:01:31,490 --> 00:01:34,180
same memory location, that is really bad.

41
00:01:34,180 --> 00:01:36,460
We cannot have that.

42
00:01:36,460 --> 00:01:38,920
We have some other really special things

43
00:01:38,920 --> 00:01:41,030
also happening at the hardware level.

44
00:01:41,030 --> 00:01:44,320
We really have value semantics
at the hardware level.

45
00:01:44,320 --> 00:01:46,050
We've gotta appreciate this when we

46
00:01:46,050 --> 00:01:48,380
start writing multi-threaded software

47
00:01:48,380 --> 00:01:50,520
because it can really
hurt our performance.

48
00:01:50,520 --> 00:01:51,890
Our cacheing systems,

49
00:01:51,890 --> 00:01:54,920
though they're helping
us try to reduce latency

50
00:01:54,920 --> 00:01:56,840
to main memory access,

51
00:01:56,840 --> 00:01:59,010
you can actually be thrashing memory

52
00:01:59,010 --> 00:02:00,810
as well if we're not careful.

53
00:02:00,810 --> 00:02:03,140
Now I am praying that you went back

54
00:02:03,140 --> 00:02:04,490
and watched this entire video,

55
00:02:04,490 --> 00:02:06,450
that you didn't start here in concurrency.

56
00:02:06,450 --> 00:02:09,220
Because in the array section,
in data registration,

57
00:02:09,220 --> 00:02:13,020
we really talk about cache
lines and how these things work.

58
00:02:13,020 --> 00:02:14,290
And it's gonna be very important

59
00:02:14,290 --> 00:02:16,690
for you to watch that
stuff before you come here.

60
00:02:16,690 --> 00:02:18,608
I'm gonna assume that you've seen

61
00:02:18,608 --> 00:02:22,450
all the array and cache
line stuff as we get going.

62
00:02:22,450 --> 00:02:25,260
Now, let's talk about the
cache coherency problem

63
00:02:25,260 --> 00:02:28,000
and how these value semantics
could come in and hurt us.

64
00:02:28,000 --> 00:02:29,380
And then we're gonna talk about something

65
00:02:29,380 --> 00:02:32,610
even more interesting to
me, which is false sharing,

66
00:02:32,610 --> 00:02:37,200
all around this idea again of
the cache coherency problem.

67
00:02:37,200 --> 00:02:40,930
So let's start with our
four core processor.

68
00:02:40,930 --> 00:02:45,930
Now let's say we have our four
core processor, here it is.

69
00:02:46,360 --> 00:02:51,360
And we've got our core
one, two, three, and four.

70
00:02:54,120 --> 00:02:56,370
And we know that every one of these cores

71
00:02:56,370 --> 00:03:01,323
has it's own L1 and L2
cache with the shared L3.

72
00:03:02,250 --> 00:03:03,900
Now I said that the hardware has

73
00:03:03,900 --> 00:03:06,380
these value semantics
because what I told you,

74
00:03:06,380 --> 00:03:07,213
value semantics means

75
00:03:07,213 --> 00:03:10,030
that we're always operating
on our own copy of the data.

76
00:03:10,030 --> 00:03:11,720
So imagine this.

77
00:03:11,720 --> 00:03:13,580
I've got a loke,

78
00:03:13,580 --> 00:03:16,830
I got a global variable in
a sense, we call it counter.

79
00:03:16,830 --> 00:03:18,430
Starts at zero.

80
00:03:18,430 --> 00:03:21,960
And I decide to launch
four Go routines, okay.

81
00:03:21,960 --> 00:03:25,840
Let's call this Go routine
zero, Go routine one,

82
00:03:25,840 --> 00:03:29,360
Go routine two, and Go routine three.

83
00:03:29,360 --> 00:03:32,440
And these four Go routines
are gonna run in parallel.

84
00:03:32,440 --> 00:03:34,530
We're gonna get them
to run on their own P,

85
00:03:34,530 --> 00:03:36,450
on their own M, and
therefore they're gonna

86
00:03:36,450 --> 00:03:39,280
execute against their own individual core.

87
00:03:39,280 --> 00:03:43,850
Now, if we want every Go routine to

88
00:03:43,850 --> 00:03:46,800
be remodifying and writing C,

89
00:03:46,800 --> 00:03:49,490
we now have a synchronization issue.

90
00:03:49,490 --> 00:03:50,570
We've gotta make sure that only

91
00:03:50,570 --> 00:03:54,550
one Go routine reads, modifies,
write at the same time.

92
00:03:54,550 --> 00:03:56,010
This is where we're gonna need things like

93
00:03:56,010 --> 00:03:57,240
atomic instructions,

94
00:03:57,240 --> 00:03:59,630
which are at the hardware
level, and mutexes,

95
00:03:59,630 --> 00:04:01,387
which are just above that, okay.

96
00:04:01,387 --> 00:04:03,050
I'm gonna show you soon how to

97
00:04:03,050 --> 00:04:05,380
leverage your atomic and your mutexes

98
00:04:05,380 --> 00:04:07,590
to create these atomic instructions,

99
00:04:07,590 --> 00:04:09,760
or make multiple statements in time.

100
00:04:09,760 --> 00:04:13,120
But let's get back down to
the hardware for a second.

101
00:04:13,120 --> 00:04:14,950
I want you to remember something here.

102
00:04:14,950 --> 00:04:16,220
If every Go routine is going to

103
00:04:16,220 --> 00:04:19,330
take an opportunity to
read, modify, write C,

104
00:04:19,330 --> 00:04:22,970
then a copy of C has to be brought

105
00:04:22,970 --> 00:04:26,690
in to each core, there it is.

106
00:04:26,690 --> 00:04:28,240
That's our copy of C, right?

107
00:04:28,240 --> 00:04:32,310
The cache line that C is on wherever

108
00:04:32,310 --> 00:04:34,840
that is, is gonna brought in.

109
00:04:34,840 --> 00:04:37,378
Now this is where things get interesting.

110
00:04:37,378 --> 00:04:40,070
If we, let's say we're gonna use our

111
00:04:40,070 --> 00:04:42,200
entomic instructions since it's

112
00:04:42,200 --> 00:04:44,900
counter here, to do synchronization.

113
00:04:44,900 --> 00:04:45,952
That's fine.

114
00:04:45,952 --> 00:04:47,663
That means that when Go routine zero

115
00:04:47,663 --> 00:04:50,770
decides to perform a read, modify, write

116
00:04:50,770 --> 00:04:52,955
to turn this from zero to one,

117
00:04:52,955 --> 00:04:55,740
then through the magic of hardware

118
00:04:55,740 --> 00:04:58,740
the hardware's gonna
say, okay G1, G2, and G3

119
00:04:58,740 --> 00:04:59,980
if you plan on doing this,

120
00:04:59,980 --> 00:05:02,070
I'm gonna put you on hold okay,

121
00:05:02,070 --> 00:05:03,940
because G0 on Core one

122
00:05:03,940 --> 00:05:05,150
is doing it's thing,

123
00:05:05,150 --> 00:05:07,687
and we're gonna make
sure that that happens.

124
00:05:07,687 --> 00:05:10,100
But look at how we're also

125
00:05:10,100 --> 00:05:12,580
gonna be able to do thrashing memory here.

126
00:05:12,580 --> 00:05:14,010
If we go ahead

127
00:05:14,010 --> 00:05:17,760
and increment this cache
line to be one to zero,

128
00:05:17,760 --> 00:05:20,691
then once we modify this cache line,

129
00:05:20,691 --> 00:05:23,150
move through the magic of hardware as well

130
00:05:23,150 --> 00:05:24,670
through these snooping protocols,

131
00:05:24,670 --> 00:05:27,650
we're gonna be marking
these cache lines dirty.

132
00:05:27,650 --> 00:05:30,170
Because we've now just updated this one,

133
00:05:30,170 --> 00:05:31,230
we're gonna probably,

134
00:05:31,230 --> 00:05:34,230
those results will go back there in main.

135
00:05:34,230 --> 00:05:35,240
But we've updated this,

136
00:05:35,240 --> 00:05:40,080
this is now really the current
right state of what C is.

137
00:05:40,080 --> 00:05:42,990
C isn't zero anymore, C is one.

138
00:05:42,990 --> 00:05:45,700
But, our copy of the cache line

139
00:05:45,700 --> 00:05:49,070
still has the representation of zero,

140
00:05:49,070 --> 00:05:50,600
and we've marked it dirty.

141
00:05:50,600 --> 00:05:54,623
So now when Go one, Go
routine one gets a chance,

142
00:05:55,573 --> 00:05:58,080
right, through
synchronization, to modify C,

143
00:05:58,080 --> 00:06:00,290
it's gonna identify
that the cache line that

144
00:06:00,290 --> 00:06:02,496
it has is dirty, it's not the true,

145
00:06:02,496 --> 00:06:05,490
it's gonna have to go out into main memory

146
00:06:05,490 --> 00:06:08,540
to go get a copy of the current one.

147
00:06:08,540 --> 00:06:11,690
And now this isn't gonna
be dirty anymore, right?

148
00:06:11,690 --> 00:06:13,950
And then it's going to increment this

149
00:06:13,950 --> 00:06:17,460
from one to two, which
is brilliant, right?

150
00:06:17,460 --> 00:06:19,610
And we're gonna have synchronization, too.

151
00:06:19,610 --> 00:06:20,890
But once we do that,

152
00:06:20,890 --> 00:06:23,540
it's gonna mark that cache line dirty,

153
00:06:23,540 --> 00:06:26,560
and it's gonna mark this
cache line dirty again.

154
00:06:26,560 --> 00:06:27,393
Whoa.

155
00:06:27,393 --> 00:06:28,740
Now remember these increments

156
00:06:28,740 --> 00:06:31,940
are probably happening
across nanoseconds of time.

157
00:06:31,940 --> 00:06:34,980
So as the synchronization is in place,

158
00:06:34,980 --> 00:06:37,040
which is still a cost, right?

159
00:06:37,040 --> 00:06:41,200
We're gonna be constantly
thrashing through memory

160
00:06:41,200 --> 00:06:43,180
on every single increment.

161
00:06:43,180 --> 00:06:45,480
That 107 clock cycle latency hit

162
00:06:45,480 --> 00:06:46,766
to bring us back in

163
00:06:46,766 --> 00:06:51,766
because we have shared this
value across all four cores.

164
00:06:52,780 --> 00:06:55,800
And I you've got a 36 core processor,

165
00:06:55,800 --> 00:06:57,630
this is gonna be even worse.

166
00:06:57,630 --> 00:07:00,220
Especially if you've
got multiple processors

167
00:07:00,220 --> 00:07:02,100
in there with multiple cores.

168
00:07:02,100 --> 00:07:04,380
All of that processor communication

169
00:07:04,380 --> 00:07:06,080
that we were talking about internally here

170
00:07:06,080 --> 00:07:08,060
also has to happen outside.

171
00:07:08,060 --> 00:07:09,250
This is very, very nasty,

172
00:07:09,250 --> 00:07:12,380
but be very careful about global variables

173
00:07:12,380 --> 00:07:16,290
and global counters in a
multi-threaded situation,

174
00:07:16,290 --> 00:07:18,850
because we're not really just referencing

175
00:07:18,850 --> 00:07:21,230
this one value and updating it.

176
00:07:21,230 --> 00:07:22,690
Remember we've got value

177
00:07:22,690 --> 00:07:27,050
semantics at the hardware
level to reduce access

178
00:07:27,050 --> 00:07:28,450
to main memory costs,

179
00:07:28,450 --> 00:07:32,150
and therefore as we share data, right,

180
00:07:32,150 --> 00:07:34,330
accessing data again across these cores,

181
00:07:34,330 --> 00:07:35,824
we've got some other problems.

182
00:07:35,824 --> 00:07:40,380
But there's also something
here called false sharing,

183
00:07:40,380 --> 00:07:41,955
which is super interesting.

184
00:07:41,955 --> 00:07:45,690
And false sharing occurs
when you don't really

185
00:07:45,690 --> 00:07:48,140
have a synchronization problem,

186
00:07:48,140 --> 00:07:51,020
but we still have the
cache coherency issue.

187
00:07:51,020 --> 00:07:53,600
Imagine we said okay, I don't wanna share,

188
00:07:53,600 --> 00:07:55,300
I don't wanna increment the same

189
00:07:55,300 --> 00:07:57,800
global variable cache on my cores.

190
00:07:57,800 --> 00:08:01,060
Let's bring our processor
back into play here.

191
00:08:01,060 --> 00:08:04,350
So here's our processor again, here it is.

192
00:08:04,350 --> 00:08:09,350
Core one, core two, core three, core four.

193
00:08:10,985 --> 00:08:14,468
L1, L2, and our L3.

194
00:08:14,468 --> 00:08:17,100
And this time what we say is even though

195
00:08:17,100 --> 00:08:20,500
we're gonna be running
a Go routine across,

196
00:08:20,500 --> 00:08:22,582
Go routine zero, Go routine one,

197
00:08:22,582 --> 00:08:25,443
Go routine two, and Go routine three,

198
00:08:26,600 --> 00:08:30,010
we're not gonna have them
increment the same counter.

199
00:08:30,010 --> 00:08:32,060
What we're gonna do is have them

200
00:08:32,060 --> 00:08:34,570
increment their own counter.

201
00:08:34,570 --> 00:08:36,940
Zero, one, two, three.

202
00:08:36,940 --> 00:08:39,070
Let's just say we did it like that.

203
00:08:39,070 --> 00:08:43,470
Now, we don't have a
synchronization issue anymore.

204
00:08:43,470 --> 00:08:45,800
The address for index zero is

205
00:08:45,800 --> 00:08:49,830
completely independent of
the address of index one.

206
00:08:49,830 --> 00:08:53,300
So, I don't need any any
atomic instructions or mutexes.

207
00:08:53,300 --> 00:08:55,370
When G0 wants to increment it's

208
00:08:55,370 --> 00:08:59,160
counter from zero to one,
it shouldn't effect what

209
00:08:59,160 --> 00:09:02,210
G1 wants to do, which is from zero to one.

210
00:09:02,210 --> 00:09:04,010
It won't effect necessarily what this

211
00:09:04,010 --> 00:09:05,730
wants to do zero to one, because they're

212
00:09:05,730 --> 00:09:08,830
all accessing their own
independent memory location,

213
00:09:08,830 --> 00:09:10,320
and we don't have a data race.

214
00:09:10,320 --> 00:09:12,140
We don't have two Go routines trying

215
00:09:12,140 --> 00:09:15,320
to read and modify, right, the same

216
00:09:15,320 --> 00:09:17,047
memory location at the same time.

217
00:09:17,047 --> 00:09:22,047
However, remember our value
semantics at the hardware level.

218
00:09:22,200 --> 00:09:24,800
Just because we don't have
a synchronization issue,

219
00:09:24,800 --> 00:09:26,980
again doesn't mean that
our cache coherency

220
00:09:26,980 --> 00:09:29,296
problem doesn't exist.

221
00:09:29,296 --> 00:09:33,920
This array is still going to fall

222
00:09:33,920 --> 00:09:38,690
on the same cache line,
that 64 byte cache line.

223
00:09:38,690 --> 00:09:40,687
So even though the address was zero, it's

224
00:09:40,687 --> 00:09:42,830
not the same as address one.

225
00:09:42,830 --> 00:09:47,830
Well the cache line for the entire array

226
00:09:47,930 --> 00:09:52,050
is going to be duplicated
across all of the cores.

227
00:09:52,050 --> 00:09:55,340
And so though we don't
need an atomic instruction

228
00:09:55,340 --> 00:09:58,610
when G1 wants to do a read, modify, write,

229
00:09:58,610 --> 00:10:01,820
against this, or G0 and G1 wants to a

230
00:10:01,820 --> 00:10:03,510
read, modify, write against that.

231
00:10:03,510 --> 00:10:06,610
Even though I don't need
the atomic instructions

232
00:10:06,610 --> 00:10:10,400
anymore, right, or the
mutex, however one we choose.

233
00:10:10,400 --> 00:10:12,370
However, what's gonna happen is

234
00:10:12,370 --> 00:10:15,690
when G0 read, modifies, write index zero,

235
00:10:15,690 --> 00:10:19,860
it's still going to mark all
of these cache lines dirty.

236
00:10:19,860 --> 00:10:23,170
Fact, every one of these increments will

237
00:10:23,170 --> 00:10:25,700
continue to mark these cache lines dirty.

238
00:10:25,700 --> 00:10:28,523
We're gonna have to come
back out, then come back in.

239
00:10:29,511 --> 00:10:31,022
So we're still gonna be taking the

240
00:10:31,022 --> 00:10:34,062
thrashing of the memory because

241
00:10:34,062 --> 00:10:36,970
these counters are sitting on the

242
00:10:36,970 --> 00:10:38,697
same cache line (laughter).

243
00:10:39,790 --> 00:10:42,260
Multi-threaded software is complicated.

244
00:10:42,260 --> 00:10:44,788
Especially when you've got data

245
00:10:44,788 --> 00:10:46,540
that you're working with,

246
00:10:46,540 --> 00:10:48,050
that is next to each other

247
00:10:48,050 --> 00:10:50,300
even though you're not
necessarily conflicting.

248
00:10:50,300 --> 00:10:52,000
That's the false sharing.

249
00:10:52,000 --> 00:10:54,410
You've got data access patterns

250
00:10:54,410 --> 00:10:56,600
to memory that is next to each other

251
00:10:56,600 --> 00:10:58,440
even though they are unique, but they

252
00:10:58,440 --> 00:11:00,690
fall on the same cache line.

253
00:11:00,690 --> 00:11:05,012
Look, we essentially are sharing the data.

254
00:11:05,012 --> 00:11:08,040
So as we continue to
learn about data races,

255
00:11:08,040 --> 00:11:09,930
and synchroniZation, and orchestration.

256
00:11:09,930 --> 00:11:11,340
And as we start moving into the

257
00:11:11,340 --> 00:11:13,150
tooling and do some more live coding

258
00:11:13,150 --> 00:11:14,860
I'm gonna bring this stuff back.

259
00:11:14,860 --> 00:11:17,530
Because any time you
have a global variable

260
00:11:17,530 --> 00:11:19,690
you've gotta worry about synchronization,

261
00:11:19,690 --> 00:11:21,270
because if you've got multiple paths of

262
00:11:21,270 --> 00:11:23,320
execution, you can't have a read and

263
00:11:23,320 --> 00:11:25,150
a write happening at the same time.

264
00:11:25,150 --> 00:11:26,400
But we also have to worry about

265
00:11:26,400 --> 00:11:28,980
data access patterns even if the

266
00:11:28,980 --> 00:11:31,360
data is unique but is next to each other,

267
00:11:31,360 --> 00:11:32,740
because we don't wanna deal with either

268
00:11:32,740 --> 00:11:34,617
false sharing, like this is.

269
00:11:34,617 --> 00:11:36,770
But we don't wanna deal
with cache coherency

270
00:11:36,770 --> 00:11:39,220
problems where all we're doing is

271
00:11:39,220 --> 00:11:41,450
thrashing through memory because

272
00:11:41,450 --> 00:11:44,410
copies of that data is being leveraged,

273
00:11:44,410 --> 00:11:46,620
re-modified really that's our problem,

274
00:11:46,620 --> 00:11:47,860
across all these cores.

275
00:11:47,860 --> 00:11:50,248
So we're gonna bring
back the cache coherency

276
00:11:50,248 --> 00:11:51,081
and the false sharing issues,

277
00:11:51,081 --> 00:11:53,789
as we continue in this class
I'm gonna bring them up.

278
00:11:53,789 --> 00:11:54,622
Because again,

279
00:11:54,622 --> 00:11:55,455
when we're writing
multi-threaded software,

280
00:11:55,455 --> 00:11:59,527
when we're writing
multi-threaded software,

281
00:11:59,527 --> 00:12:03,820
we've got to be able to see that type

282
00:12:03,820 --> 00:12:07,060
of linear performance growth as

283
00:12:07,060 --> 00:12:10,080
the number of cores increase.

284
00:12:10,080 --> 00:12:14,400
We wanna see this type of
growth in our performance.

285
00:12:14,400 --> 00:12:17,890
If you start seeing things like this,

286
00:12:17,890 --> 00:12:19,450
you know whatever that is,

287
00:12:19,450 --> 00:12:21,760
that means we're not being
mechanically sympathetic.

288
00:12:21,760 --> 00:12:24,200
It could be our cache coherency issues

289
00:12:24,200 --> 00:12:26,010
where we're thrashing through memory.

290
00:12:26,010 --> 00:12:29,040
So we're gonna keep this
stuff in mind as we continue.

291
00:12:29,040 --> 00:12:30,140
And one of the first things we're gonna do

292
00:12:30,140 --> 00:12:32,710
is just we're gonna very simple data race,

293
00:12:32,710 --> 00:12:34,780
and look at ways that we can correct

294
00:12:34,780 --> 00:12:36,970
the code to make sure that we

295
00:12:36,970 --> 00:12:39,083
have no synchronization issues.