1 00:00:06,580 --> 00:00:08,890 - Okay we're now about to talk about data races. 2 00:00:08,890 --> 00:00:11,310 And what a data race is is when you've got 3 00:00:11,310 --> 00:00:14,976 at least two paths of execution, like two Go routines, 4 00:00:14,976 --> 00:00:18,667 accessing the same memory location at the same time, 5 00:00:18,667 --> 00:00:20,640 where one is doing a read, 6 00:00:20,640 --> 00:00:22,370 and the other is at least doing a write. 7 00:00:22,370 --> 00:00:24,520 I mean they both could me doing writes too, 8 00:00:24,520 --> 00:00:26,170 and that would be a data race. 9 00:00:26,170 --> 00:00:28,430 And that is bad, okay? 10 00:00:28,430 --> 00:00:31,860 You cannot be mutating memory okay? 11 00:00:31,860 --> 00:00:34,000 It's two paths of execution at the same time, 12 00:00:34,000 --> 00:00:35,509 we're gonna have data corruption. 13 00:00:35,509 --> 00:00:39,610 And this is where synchronization comes in. 14 00:00:39,610 --> 00:00:42,030 There's two things here, synchronization and orchestration. 15 00:00:42,030 --> 00:00:43,870 We're focusing on synchronization 16 00:00:43,870 --> 00:00:45,870 right now in this data race section. 17 00:00:45,870 --> 00:00:47,380 The best way to think about synchronization 18 00:00:47,380 --> 00:00:49,580 is if you went to Starbucks okay, 19 00:00:49,580 --> 00:00:51,830 and you got in line 'cause you wanna get some coffee. 20 00:00:51,830 --> 00:00:52,699 And now you're in line, 21 00:00:52,699 --> 00:00:56,070 you're waiting for your turn to get up to the counter. 22 00:00:56,070 --> 00:00:59,300 Anytime Go routines have to get in line, 23 00:00:59,300 --> 00:01:01,620 that is a synchronization issue. 24 00:01:01,620 --> 00:01:03,000 But once you get to the counter 25 00:01:03,000 --> 00:01:05,580 and you start talking to the person at the register, 26 00:01:05,580 --> 00:01:07,520 you now have an orchestration issue. 27 00:01:07,520 --> 00:01:10,050 We're having a conversation, we're exchanging money, 28 00:01:10,050 --> 00:01:11,760 there's data going back and forth. 29 00:01:11,760 --> 00:01:13,030 That is orchestration. 30 00:01:13,030 --> 00:01:14,570 And we have these two problems, 31 00:01:14,570 --> 00:01:16,050 synchronization and orchestration. 32 00:01:16,050 --> 00:01:17,430 And your biggest job is to know 33 00:01:17,430 --> 00:01:19,100 when it's a synchronization issue, 34 00:01:19,100 --> 00:01:20,370 when do we have to get in line? 35 00:01:20,370 --> 00:01:22,020 And when it's an orchestration issue, 36 00:01:22,020 --> 00:01:26,220 when is there a work flow going on. 37 00:01:26,220 --> 00:01:28,060 Now again, a data race is when 38 00:01:28,060 --> 00:01:29,480 you have two or more Go routines 39 00:01:29,480 --> 00:01:31,490 where one's doing a read and one is doing a write to the 40 00:01:31,490 --> 00:01:34,180 same memory location, that is really bad. 41 00:01:34,180 --> 00:01:36,460 We cannot have that. 42 00:01:36,460 --> 00:01:38,920 We have some other really special things 43 00:01:38,920 --> 00:01:41,030 also happening at the hardware level. 44 00:01:41,030 --> 00:01:44,320 We really have value semantics at the hardware level. 45 00:01:44,320 --> 00:01:46,050 We've gotta appreciate this when we 46 00:01:46,050 --> 00:01:48,380 start writing multi-threaded software 47 00:01:48,380 --> 00:01:50,520 because it can really hurt our performance. 48 00:01:50,520 --> 00:01:51,890 Our cacheing systems, 49 00:01:51,890 --> 00:01:54,920 though they're helping us try to reduce latency 50 00:01:54,920 --> 00:01:56,840 to main memory access, 51 00:01:56,840 --> 00:01:59,010 you can actually be thrashing memory 52 00:01:59,010 --> 00:02:00,810 as well if we're not careful. 53 00:02:00,810 --> 00:02:03,140 Now I am praying that you went back 54 00:02:03,140 --> 00:02:04,490 and watched this entire video, 55 00:02:04,490 --> 00:02:06,450 that you didn't start here in concurrency. 56 00:02:06,450 --> 00:02:09,220 Because in the array section, in data registration, 57 00:02:09,220 --> 00:02:13,020 we really talk about cache lines and how these things work. 58 00:02:13,020 --> 00:02:14,290 And it's gonna be very important 59 00:02:14,290 --> 00:02:16,690 for you to watch that stuff before you come here. 60 00:02:16,690 --> 00:02:18,608 I'm gonna assume that you've seen 61 00:02:18,608 --> 00:02:22,450 all the array and cache line stuff as we get going. 62 00:02:22,450 --> 00:02:25,260 Now, let's talk about the cache coherency problem 63 00:02:25,260 --> 00:02:28,000 and how these value semantics could come in and hurt us. 64 00:02:28,000 --> 00:02:29,380 And then we're gonna talk about something 65 00:02:29,380 --> 00:02:32,610 even more interesting to me, which is false sharing, 66 00:02:32,610 --> 00:02:37,200 all around this idea again of the cache coherency problem. 67 00:02:37,200 --> 00:02:40,930 So let's start with our four core processor. 68 00:02:40,930 --> 00:02:45,930 Now let's say we have our four core processor, here it is. 69 00:02:46,360 --> 00:02:51,360 And we've got our core one, two, three, and four. 70 00:02:54,120 --> 00:02:56,370 And we know that every one of these cores 71 00:02:56,370 --> 00:03:01,323 has it's own L1 and L2 cache with the shared L3. 72 00:03:02,250 --> 00:03:03,900 Now I said that the hardware has 73 00:03:03,900 --> 00:03:06,380 these value semantics because what I told you, 74 00:03:06,380 --> 00:03:07,213 value semantics means 75 00:03:07,213 --> 00:03:10,030 that we're always operating on our own copy of the data. 76 00:03:10,030 --> 00:03:11,720 So imagine this. 77 00:03:11,720 --> 00:03:13,580 I've got a loke, 78 00:03:13,580 --> 00:03:16,830 I got a global variable in a sense, we call it counter. 79 00:03:16,830 --> 00:03:18,430 Starts at zero. 80 00:03:18,430 --> 00:03:21,960 And I decide to launch four Go routines, okay. 81 00:03:21,960 --> 00:03:25,840 Let's call this Go routine zero, Go routine one, 82 00:03:25,840 --> 00:03:29,360 Go routine two, and Go routine three. 83 00:03:29,360 --> 00:03:32,440 And these four Go routines are gonna run in parallel. 84 00:03:32,440 --> 00:03:34,530 We're gonna get them to run on their own P, 85 00:03:34,530 --> 00:03:36,450 on their own M, and therefore they're gonna 86 00:03:36,450 --> 00:03:39,280 execute against their own individual core. 87 00:03:39,280 --> 00:03:43,850 Now, if we want every Go routine to 88 00:03:43,850 --> 00:03:46,800 be remodifying and writing C, 89 00:03:46,800 --> 00:03:49,490 we now have a synchronization issue. 90 00:03:49,490 --> 00:03:50,570 We've gotta make sure that only 91 00:03:50,570 --> 00:03:54,550 one Go routine reads, modifies, write at the same time. 92 00:03:54,550 --> 00:03:56,010 This is where we're gonna need things like 93 00:03:56,010 --> 00:03:57,240 atomic instructions, 94 00:03:57,240 --> 00:03:59,630 which are at the hardware level, and mutexes, 95 00:03:59,630 --> 00:04:01,387 which are just above that, okay. 96 00:04:01,387 --> 00:04:03,050 I'm gonna show you soon how to 97 00:04:03,050 --> 00:04:05,380 leverage your atomic and your mutexes 98 00:04:05,380 --> 00:04:07,590 to create these atomic instructions, 99 00:04:07,590 --> 00:04:09,760 or make multiple statements in time. 100 00:04:09,760 --> 00:04:13,120 But let's get back down to the hardware for a second. 101 00:04:13,120 --> 00:04:14,950 I want you to remember something here. 102 00:04:14,950 --> 00:04:16,220 If every Go routine is going to 103 00:04:16,220 --> 00:04:19,330 take an opportunity to read, modify, write C, 104 00:04:19,330 --> 00:04:22,970 then a copy of C has to be brought 105 00:04:22,970 --> 00:04:26,690 in to each core, there it is. 106 00:04:26,690 --> 00:04:28,240 That's our copy of C, right? 107 00:04:28,240 --> 00:04:32,310 The cache line that C is on wherever 108 00:04:32,310 --> 00:04:34,840 that is, is gonna brought in. 109 00:04:34,840 --> 00:04:37,378 Now this is where things get interesting. 110 00:04:37,378 --> 00:04:40,070 If we, let's say we're gonna use our 111 00:04:40,070 --> 00:04:42,200 entomic instructions since it's 112 00:04:42,200 --> 00:04:44,900 counter here, to do synchronization. 113 00:04:44,900 --> 00:04:45,952 That's fine. 114 00:04:45,952 --> 00:04:47,663 That means that when Go routine zero 115 00:04:47,663 --> 00:04:50,770 decides to perform a read, modify, write 116 00:04:50,770 --> 00:04:52,955 to turn this from zero to one, 117 00:04:52,955 --> 00:04:55,740 then through the magic of hardware 118 00:04:55,740 --> 00:04:58,740 the hardware's gonna say, okay G1, G2, and G3 119 00:04:58,740 --> 00:04:59,980 if you plan on doing this, 120 00:04:59,980 --> 00:05:02,070 I'm gonna put you on hold okay, 121 00:05:02,070 --> 00:05:03,940 because G0 on Core one 122 00:05:03,940 --> 00:05:05,150 is doing it's thing, 123 00:05:05,150 --> 00:05:07,687 and we're gonna make sure that that happens. 124 00:05:07,687 --> 00:05:10,100 But look at how we're also 125 00:05:10,100 --> 00:05:12,580 gonna be able to do thrashing memory here. 126 00:05:12,580 --> 00:05:14,010 If we go ahead 127 00:05:14,010 --> 00:05:17,760 and increment this cache line to be one to zero, 128 00:05:17,760 --> 00:05:20,691 then once we modify this cache line, 129 00:05:20,691 --> 00:05:23,150 move through the magic of hardware as well 130 00:05:23,150 --> 00:05:24,670 through these snooping protocols, 131 00:05:24,670 --> 00:05:27,650 we're gonna be marking these cache lines dirty. 132 00:05:27,650 --> 00:05:30,170 Because we've now just updated this one, 133 00:05:30,170 --> 00:05:31,230 we're gonna probably, 134 00:05:31,230 --> 00:05:34,230 those results will go back there in main. 135 00:05:34,230 --> 00:05:35,240 But we've updated this, 136 00:05:35,240 --> 00:05:40,080 this is now really the current right state of what C is. 137 00:05:40,080 --> 00:05:42,990 C isn't zero anymore, C is one. 138 00:05:42,990 --> 00:05:45,700 But, our copy of the cache line 139 00:05:45,700 --> 00:05:49,070 still has the representation of zero, 140 00:05:49,070 --> 00:05:50,600 and we've marked it dirty. 141 00:05:50,600 --> 00:05:54,623 So now when Go one, Go routine one gets a chance, 142 00:05:55,573 --> 00:05:58,080 right, through synchronization, to modify C, 143 00:05:58,080 --> 00:06:00,290 it's gonna identify that the cache line that 144 00:06:00,290 --> 00:06:02,496 it has is dirty, it's not the true, 145 00:06:02,496 --> 00:06:05,490 it's gonna have to go out into main memory 146 00:06:05,490 --> 00:06:08,540 to go get a copy of the current one. 147 00:06:08,540 --> 00:06:11,690 And now this isn't gonna be dirty anymore, right? 148 00:06:11,690 --> 00:06:13,950 And then it's going to increment this 149 00:06:13,950 --> 00:06:17,460 from one to two, which is brilliant, right? 150 00:06:17,460 --> 00:06:19,610 And we're gonna have synchronization, too. 151 00:06:19,610 --> 00:06:20,890 But once we do that, 152 00:06:20,890 --> 00:06:23,540 it's gonna mark that cache line dirty, 153 00:06:23,540 --> 00:06:26,560 and it's gonna mark this cache line dirty again. 154 00:06:26,560 --> 00:06:27,393 Whoa. 155 00:06:27,393 --> 00:06:28,740 Now remember these increments 156 00:06:28,740 --> 00:06:31,940 are probably happening across nanoseconds of time. 157 00:06:31,940 --> 00:06:34,980 So as the synchronization is in place, 158 00:06:34,980 --> 00:06:37,040 which is still a cost, right? 159 00:06:37,040 --> 00:06:41,200 We're gonna be constantly thrashing through memory 160 00:06:41,200 --> 00:06:43,180 on every single increment. 161 00:06:43,180 --> 00:06:45,480 That 107 clock cycle latency hit 162 00:06:45,480 --> 00:06:46,766 to bring us back in 163 00:06:46,766 --> 00:06:51,766 because we have shared this value across all four cores. 164 00:06:52,780 --> 00:06:55,800 And I you've got a 36 core processor, 165 00:06:55,800 --> 00:06:57,630 this is gonna be even worse. 166 00:06:57,630 --> 00:07:00,220 Especially if you've got multiple processors 167 00:07:00,220 --> 00:07:02,100 in there with multiple cores. 168 00:07:02,100 --> 00:07:04,380 All of that processor communication 169 00:07:04,380 --> 00:07:06,080 that we were talking about internally here 170 00:07:06,080 --> 00:07:08,060 also has to happen outside. 171 00:07:08,060 --> 00:07:09,250 This is very, very nasty, 172 00:07:09,250 --> 00:07:12,380 but be very careful about global variables 173 00:07:12,380 --> 00:07:16,290 and global counters in a multi-threaded situation, 174 00:07:16,290 --> 00:07:18,850 because we're not really just referencing 175 00:07:18,850 --> 00:07:21,230 this one value and updating it. 176 00:07:21,230 --> 00:07:22,690 Remember we've got value 177 00:07:22,690 --> 00:07:27,050 semantics at the hardware level to reduce access 178 00:07:27,050 --> 00:07:28,450 to main memory costs, 179 00:07:28,450 --> 00:07:32,150 and therefore as we share data, right, 180 00:07:32,150 --> 00:07:34,330 accessing data again across these cores, 181 00:07:34,330 --> 00:07:35,824 we've got some other problems. 182 00:07:35,824 --> 00:07:40,380 But there's also something here called false sharing, 183 00:07:40,380 --> 00:07:41,955 which is super interesting. 184 00:07:41,955 --> 00:07:45,690 And false sharing occurs when you don't really 185 00:07:45,690 --> 00:07:48,140 have a synchronization problem, 186 00:07:48,140 --> 00:07:51,020 but we still have the cache coherency issue. 187 00:07:51,020 --> 00:07:53,600 Imagine we said okay, I don't wanna share, 188 00:07:53,600 --> 00:07:55,300 I don't wanna increment the same 189 00:07:55,300 --> 00:07:57,800 global variable cache on my cores. 190 00:07:57,800 --> 00:08:01,060 Let's bring our processor back into play here. 191 00:08:01,060 --> 00:08:04,350 So here's our processor again, here it is. 192 00:08:04,350 --> 00:08:09,350 Core one, core two, core three, core four. 193 00:08:10,985 --> 00:08:14,468 L1, L2, and our L3. 194 00:08:14,468 --> 00:08:17,100 And this time what we say is even though 195 00:08:17,100 --> 00:08:20,500 we're gonna be running a Go routine across, 196 00:08:20,500 --> 00:08:22,582 Go routine zero, Go routine one, 197 00:08:22,582 --> 00:08:25,443 Go routine two, and Go routine three, 198 00:08:26,600 --> 00:08:30,010 we're not gonna have them increment the same counter. 199 00:08:30,010 --> 00:08:32,060 What we're gonna do is have them 200 00:08:32,060 --> 00:08:34,570 increment their own counter. 201 00:08:34,570 --> 00:08:36,940 Zero, one, two, three. 202 00:08:36,940 --> 00:08:39,070 Let's just say we did it like that. 203 00:08:39,070 --> 00:08:43,470 Now, we don't have a synchronization issue anymore. 204 00:08:43,470 --> 00:08:45,800 The address for index zero is 205 00:08:45,800 --> 00:08:49,830 completely independent of the address of index one. 206 00:08:49,830 --> 00:08:53,300 So, I don't need any any atomic instructions or mutexes. 207 00:08:53,300 --> 00:08:55,370 When G0 wants to increment it's 208 00:08:55,370 --> 00:08:59,160 counter from zero to one, it shouldn't effect what 209 00:08:59,160 --> 00:09:02,210 G1 wants to do, which is from zero to one. 210 00:09:02,210 --> 00:09:04,010 It won't effect necessarily what this 211 00:09:04,010 --> 00:09:05,730 wants to do zero to one, because they're 212 00:09:05,730 --> 00:09:08,830 all accessing their own independent memory location, 213 00:09:08,830 --> 00:09:10,320 and we don't have a data race. 214 00:09:10,320 --> 00:09:12,140 We don't have two Go routines trying 215 00:09:12,140 --> 00:09:15,320 to read and modify, right, the same 216 00:09:15,320 --> 00:09:17,047 memory location at the same time. 217 00:09:17,047 --> 00:09:22,047 However, remember our value semantics at the hardware level. 218 00:09:22,200 --> 00:09:24,800 Just because we don't have a synchronization issue, 219 00:09:24,800 --> 00:09:26,980 again doesn't mean that our cache coherency 220 00:09:26,980 --> 00:09:29,296 problem doesn't exist. 221 00:09:29,296 --> 00:09:33,920 This array is still going to fall 222 00:09:33,920 --> 00:09:38,690 on the same cache line, that 64 byte cache line. 223 00:09:38,690 --> 00:09:40,687 So even though the address was zero, it's 224 00:09:40,687 --> 00:09:42,830 not the same as address one. 225 00:09:42,830 --> 00:09:47,830 Well the cache line for the entire array 226 00:09:47,930 --> 00:09:52,050 is going to be duplicated across all of the cores. 227 00:09:52,050 --> 00:09:55,340 And so though we don't need an atomic instruction 228 00:09:55,340 --> 00:09:58,610 when G1 wants to do a read, modify, write, 229 00:09:58,610 --> 00:10:01,820 against this, or G0 and G1 wants to a 230 00:10:01,820 --> 00:10:03,510 read, modify, write against that. 231 00:10:03,510 --> 00:10:06,610 Even though I don't need the atomic instructions 232 00:10:06,610 --> 00:10:10,400 anymore, right, or the mutex, however one we choose. 233 00:10:10,400 --> 00:10:12,370 However, what's gonna happen is 234 00:10:12,370 --> 00:10:15,690 when G0 read, modifies, write index zero, 235 00:10:15,690 --> 00:10:19,860 it's still going to mark all of these cache lines dirty. 236 00:10:19,860 --> 00:10:23,170 Fact, every one of these increments will 237 00:10:23,170 --> 00:10:25,700 continue to mark these cache lines dirty. 238 00:10:25,700 --> 00:10:28,523 We're gonna have to come back out, then come back in. 239 00:10:29,511 --> 00:10:31,022 So we're still gonna be taking the 240 00:10:31,022 --> 00:10:34,062 thrashing of the memory because 241 00:10:34,062 --> 00:10:36,970 these counters are sitting on the 242 00:10:36,970 --> 00:10:38,697 same cache line (laughter). 243 00:10:39,790 --> 00:10:42,260 Multi-threaded software is complicated. 244 00:10:42,260 --> 00:10:44,788 Especially when you've got data 245 00:10:44,788 --> 00:10:46,540 that you're working with, 246 00:10:46,540 --> 00:10:48,050 that is next to each other 247 00:10:48,050 --> 00:10:50,300 even though you're not necessarily conflicting. 248 00:10:50,300 --> 00:10:52,000 That's the false sharing. 249 00:10:52,000 --> 00:10:54,410 You've got data access patterns 250 00:10:54,410 --> 00:10:56,600 to memory that is next to each other 251 00:10:56,600 --> 00:10:58,440 even though they are unique, but they 252 00:10:58,440 --> 00:11:00,690 fall on the same cache line. 253 00:11:00,690 --> 00:11:05,012 Look, we essentially are sharing the data. 254 00:11:05,012 --> 00:11:08,040 So as we continue to learn about data races, 255 00:11:08,040 --> 00:11:09,930 and synchroniZation, and orchestration. 256 00:11:09,930 --> 00:11:11,340 And as we start moving into the 257 00:11:11,340 --> 00:11:13,150 tooling and do some more live coding 258 00:11:13,150 --> 00:11:14,860 I'm gonna bring this stuff back. 259 00:11:14,860 --> 00:11:17,530 Because any time you have a global variable 260 00:11:17,530 --> 00:11:19,690 you've gotta worry about synchronization, 261 00:11:19,690 --> 00:11:21,270 because if you've got multiple paths of 262 00:11:21,270 --> 00:11:23,320 execution, you can't have a read and 263 00:11:23,320 --> 00:11:25,150 a write happening at the same time. 264 00:11:25,150 --> 00:11:26,400 But we also have to worry about 265 00:11:26,400 --> 00:11:28,980 data access patterns even if the 266 00:11:28,980 --> 00:11:31,360 data is unique but is next to each other, 267 00:11:31,360 --> 00:11:32,740 because we don't wanna deal with either 268 00:11:32,740 --> 00:11:34,617 false sharing, like this is. 269 00:11:34,617 --> 00:11:36,770 But we don't wanna deal with cache coherency 270 00:11:36,770 --> 00:11:39,220 problems where all we're doing is 271 00:11:39,220 --> 00:11:41,450 thrashing through memory because 272 00:11:41,450 --> 00:11:44,410 copies of that data is being leveraged, 273 00:11:44,410 --> 00:11:46,620 re-modified really that's our problem, 274 00:11:46,620 --> 00:11:47,860 across all these cores. 275 00:11:47,860 --> 00:11:50,248 So we're gonna bring back the cache coherency 276 00:11:50,248 --> 00:11:51,081 and the false sharing issues, 277 00:11:51,081 --> 00:11:53,789 as we continue in this class I'm gonna bring them up. 278 00:11:53,789 --> 00:11:54,622 Because again, 279 00:11:54,622 --> 00:11:55,455 when we're writing multi-threaded software, 280 00:11:55,455 --> 00:11:59,527 when we're writing multi-threaded software, 281 00:11:59,527 --> 00:12:03,820 we've got to be able to see that type 282 00:12:03,820 --> 00:12:07,060 of linear performance growth as 283 00:12:07,060 --> 00:12:10,080 the number of cores increase. 284 00:12:10,080 --> 00:12:14,400 We wanna see this type of growth in our performance. 285 00:12:14,400 --> 00:12:17,890 If you start seeing things like this, 286 00:12:17,890 --> 00:12:19,450 you know whatever that is, 287 00:12:19,450 --> 00:12:21,760 that means we're not being mechanically sympathetic. 288 00:12:21,760 --> 00:12:24,200 It could be our cache coherency issues 289 00:12:24,200 --> 00:12:26,010 where we're thrashing through memory. 290 00:12:26,010 --> 00:12:29,040 So we're gonna keep this stuff in mind as we continue. 291 00:12:29,040 --> 00:12:30,140 And one of the first things we're gonna do 292 00:12:30,140 --> 00:12:32,710 is just we're gonna very simple data race, 293 00:12:32,710 --> 00:12:34,780 and look at ways that we can correct 294 00:12:34,780 --> 00:12:36,970 the code to make sure that we 295 00:12:36,970 --> 00:12:39,083 have no synchronization issues.