1 00:00:01,170 --> 00:00:02,890 - [Instructor] In this video, we're going to talk 2 00:00:02,890 --> 00:00:07,083 about simulation techniques with random number generation. 3 00:00:08,360 --> 00:00:11,960 As you work in data science and you start working 4 00:00:11,960 --> 00:00:15,450 in data in general, often one of the things that you want 5 00:00:15,450 --> 00:00:19,920 to do is simulate data and random number generation 6 00:00:19,920 --> 00:00:21,410 can be really handy for that 7 00:00:21,410 --> 00:00:24,710 and it can also be handy for game playing as well. 8 00:00:24,710 --> 00:00:27,960 For demonstration purposes here, we're going to simulate 9 00:00:27,960 --> 00:00:31,750 rolling a six-sided die to get started. 10 00:00:31,750 --> 00:00:34,340 And part of what we're going to want to talk 11 00:00:34,340 --> 00:00:38,223 about is a concept called reproducibility. 12 00:00:39,190 --> 00:00:42,420 Basically, how is it that we can ensure 13 00:00:42,420 --> 00:00:45,230 that somebody executing our code 14 00:00:45,230 --> 00:00:49,383 is going to be able to get the same results that we got. 15 00:00:50,670 --> 00:00:53,990 Let's go ahead and introduce random number generation. 16 00:00:53,990 --> 00:00:56,510 In order to do that, we're going to import 17 00:00:56,510 --> 00:00:59,030 a module called random. 18 00:00:59,030 --> 00:01:01,440 The random module as you might guess 19 00:01:01,440 --> 00:01:05,560 provides capabilities for random number generation. 20 00:01:05,560 --> 00:01:08,410 In order to use it, we need to import it first. 21 00:01:08,410 --> 00:01:10,450 This is similar to what we did earlier 22 00:01:10,450 --> 00:01:14,400 with the decimal module and the statistics module. 23 00:01:14,400 --> 00:01:18,160 In this case, rather than importing a specific function 24 00:01:18,160 --> 00:01:21,880 from the random module, we simply import the module itself 25 00:01:21,880 --> 00:01:24,680 and the reason we chose that path is we're going to use 26 00:01:24,680 --> 00:01:27,993 a couple of different functions from that module. 27 00:01:29,250 --> 00:01:33,870 Let's go ahead now and try rolling a die a few times here. 28 00:01:33,870 --> 00:01:36,070 We're going to use a for loop to do that 29 00:01:36,070 --> 00:01:40,660 and we're going to say that we want to loop 10 times. 30 00:01:40,660 --> 00:01:43,490 We've done a number of for loops like this earlier 31 00:01:43,490 --> 00:01:46,650 and for each iteration of loop, we're going to print out 32 00:01:46,650 --> 00:01:49,860 a randomly generated integer so we're going to use 33 00:01:49,860 --> 00:01:54,810 the random modules randrange function to do that. 34 00:01:54,810 --> 00:01:57,210 You might be able to guess that randrange 35 00:01:57,210 --> 00:01:58,650 is kind of like range. 36 00:01:58,650 --> 00:02:01,080 It produces a range of values, 37 00:02:01,080 --> 00:02:03,770 but in the case of random number generation, 38 00:02:03,770 --> 00:02:07,520 it produces one value in that range. 39 00:02:07,520 --> 00:02:11,260 If I say one and seven, what I'm going to get back 40 00:02:11,260 --> 00:02:14,470 every time I call randrange with these two arguments 41 00:02:14,470 --> 00:02:17,530 is a number in the range one through six. 42 00:02:17,530 --> 00:02:21,310 That is starting from one, up to but not including seven, 43 00:02:21,310 --> 00:02:24,834 and that's going to be an integer value that's returned. 44 00:02:24,834 --> 00:02:27,690 Let's go ahead and display these on one line 45 00:02:27,690 --> 00:02:29,940 like we've done a number of times. 46 00:02:29,940 --> 00:02:33,370 You can see here that we get a certain set of values. 47 00:02:33,370 --> 00:02:35,920 Not all of the values in the range were produced. 48 00:02:35,920 --> 00:02:39,360 I got a couple of sixes, a few fives, 49 00:02:39,360 --> 00:02:41,350 a few ones, and a three. 50 00:02:41,350 --> 00:02:44,650 Looks like I'm missing all the twos and fours at this point, 51 00:02:44,650 --> 00:02:49,120 but if I recall that snippet and execute it again, 52 00:02:49,120 --> 00:02:50,960 I'm going to get different results. 53 00:02:50,960 --> 00:02:53,000 This time we have a two and a four, 54 00:02:53,000 --> 00:02:54,770 actually we have several twos, 55 00:02:54,770 --> 00:02:58,170 and we got a little bit better distribution of values 56 00:02:58,170 --> 00:03:01,293 the second time we executed that statement. 57 00:03:02,420 --> 00:03:05,640 Each time I execute that loop, it's going to give me 58 00:03:05,640 --> 00:03:07,670 a new set of values. 59 00:03:07,670 --> 00:03:11,910 However, there is a capability for forcing the random 60 00:03:11,910 --> 00:03:16,180 number generator to give me the same sequence of values 61 00:03:16,180 --> 00:03:17,290 every single time, 62 00:03:17,290 --> 00:03:20,800 which would be important for reproducibility, 63 00:03:20,800 --> 00:03:24,290 and again, that is a key concept as you start working 64 00:03:24,290 --> 00:03:26,850 with data and you're doing studies and you want 65 00:03:26,850 --> 00:03:30,703 other people to be able to reproduce your results. 66 00:03:31,769 --> 00:03:34,710 There are a number of cases throughout later lessons 67 00:03:34,710 --> 00:03:39,270 where you're going to need to seed a random number generator 68 00:03:39,270 --> 00:03:42,630 to ensure that you get the same sequence of values each time 69 00:03:42,630 --> 00:03:45,163 and we'll talk more about that momentarily. 70 00:03:48,238 --> 00:03:51,670 In this video, one of the things I want to do is show 71 00:03:51,670 --> 00:03:53,691 you the script, 72 00:03:53,691 --> 00:03:57,210 fig04_01.py 73 00:03:57,210 --> 00:03:59,500 which is going to take what we did here 74 00:03:59,500 --> 00:04:01,330 with random number generation 75 00:04:01,330 --> 00:04:06,110 and we're going to roll a die six million times. 76 00:04:06,110 --> 00:04:09,280 Now, if we roll a die six million times 77 00:04:09,280 --> 00:04:14,230 and it's a well balanced die, we should get approximately 78 00:04:14,230 --> 00:04:16,180 one million of each face. 79 00:04:16,180 --> 00:04:19,400 Similarly, if it's a good random number generator, 80 00:04:19,400 --> 00:04:21,980 we should get approximately one million 81 00:04:21,980 --> 00:04:23,543 of each face as well. 82 00:04:25,260 --> 00:04:27,710 Previously, when we've executed scripts 83 00:04:27,710 --> 00:04:30,390 we did that from the command line 84 00:04:30,390 --> 00:04:33,890 not in IPython interactive mode, but it turns out you can 85 00:04:33,890 --> 00:04:37,290 run scripts from interactive mode as well by using 86 00:04:37,290 --> 00:04:38,630 the run command. 87 00:04:38,630 --> 00:04:40,580 Let's go ahead and take a look at that. 88 00:04:41,550 --> 00:04:45,160 If I execute the first of these scripts here, 89 00:04:45,160 --> 00:04:47,093 which is the key one for this example. 90 00:04:48,430 --> 00:04:50,460 You can see that it's hanging a little bit. 91 00:04:50,460 --> 00:04:53,513 It's actually calculating six million rolls at the moment. 92 00:04:54,630 --> 00:04:57,570 In a few moments, there we go, it'll display the results, 93 00:04:57,570 --> 00:04:59,550 so you see the six different faces 94 00:04:59,550 --> 00:05:01,950 and you see that each one of those faces 95 00:05:01,950 --> 00:05:05,480 occurred approximately one million times. 96 00:05:05,480 --> 00:05:06,633 Let's run it again. 97 00:05:08,250 --> 00:05:11,010 In a moment, the results will be displayed. 98 00:05:11,010 --> 00:05:14,100 You may notice if you're coming from another programming 99 00:05:14,100 --> 00:05:17,260 language that because Python is interpreted, 100 00:05:17,260 --> 00:05:20,140 it may not execute certain things as quickly 101 00:05:20,140 --> 00:05:22,810 as complied languages do. 102 00:05:22,810 --> 00:05:25,350 But it did still come back relatively quickly 103 00:05:25,350 --> 00:05:27,910 and though we got different results this time, 104 00:05:27,910 --> 00:05:31,310 it is random, we got still approximately 105 00:05:31,310 --> 00:05:33,113 one million of each face. 106 00:05:34,020 --> 00:05:36,310 That demonstrates how to run a script 107 00:05:36,310 --> 00:05:39,370 in the context of IPython. 108 00:05:39,370 --> 00:05:42,120 Let's switch over to the script code for a moment here. 109 00:05:43,560 --> 00:05:47,080 For this example, since we haven't gone in-depth 110 00:05:47,080 --> 00:05:50,730 into somethings like lists yet, where we could potentially 111 00:05:50,730 --> 00:05:55,020 use a list to keep track of each die face. 112 00:05:55,020 --> 00:05:58,620 What we're going to do in this example is define 113 00:05:58,620 --> 00:06:02,960 six separate variables that represent the six different 114 00:06:02,960 --> 00:06:05,470 faces of the die and we'll keep track of how many times 115 00:06:05,470 --> 00:06:10,470 we roll each face by using an if elif else statement 116 00:06:10,520 --> 00:06:12,610 nested in a for loop. 117 00:06:12,610 --> 00:06:15,440 Here we have a for loop that's going to iterate 118 00:06:15,440 --> 00:06:16,950 six million times. 119 00:06:16,950 --> 00:06:20,560 Notice the use of underscore in a numeric literal here. 120 00:06:20,560 --> 00:06:23,310 This can be used to improve the readability 121 00:06:23,310 --> 00:06:25,100 of your literal values. 122 00:06:25,100 --> 00:06:27,990 It appears that my editor is not understanding 123 00:06:27,990 --> 00:06:31,020 the fact that this is actually the number six million, 124 00:06:31,020 --> 00:06:34,120 so let me just show you that if I remove those for a moment, 125 00:06:34,120 --> 00:06:39,120 now it's colored all the digits correctly, in that case. 126 00:06:39,240 --> 00:06:41,680 That's a bug in this particular editor. 127 00:06:41,680 --> 00:06:44,620 But in any case, we're going to iterate six million times 128 00:06:44,620 --> 00:06:47,154 and each time through we're going to use that 129 00:06:47,154 --> 00:06:52,000 randrange call we just demonstrated to get one face value 130 00:06:52,000 --> 00:06:54,770 then we're going to check if the face 131 00:06:54,770 --> 00:06:58,790 is a one, two, three, four, five, or six. 132 00:06:58,790 --> 00:07:01,760 And for whichever one of those cases is true, 133 00:07:01,760 --> 00:07:04,533 we will add one to the appropriate counter. 134 00:07:05,670 --> 00:07:07,990 If you look down below, once we've iterated 135 00:07:07,990 --> 00:07:11,520 six million times we then have some print statements 136 00:07:11,520 --> 00:07:14,480 that are going to produce the table that you saw. 137 00:07:14,480 --> 00:07:17,000 First, we're displaying a formatted string that 138 00:07:17,000 --> 00:07:19,880 has the literal characters for the word face 139 00:07:19,880 --> 00:07:22,760 and then a placeholder in which we display the string 140 00:07:22,760 --> 00:07:27,760 frequency right aligned in a field of 13 characters. 141 00:07:29,290 --> 00:07:32,300 Right justified, right aligned in a field of 13 characters, 142 00:07:32,300 --> 00:07:35,110 which gives us the two column format. 143 00:07:35,110 --> 00:07:38,610 For the subsequent lines, we display the value one, 144 00:07:38,610 --> 00:07:40,420 right aligned in a field of four. 145 00:07:40,420 --> 00:07:43,100 Then the value two, right aligned in a field of four. 146 00:07:43,100 --> 00:07:45,050 For the numbers from one through six, 147 00:07:45,050 --> 00:07:49,170 and for each of the frequencies, we also display those 148 00:07:49,170 --> 00:07:52,610 such that they'll appear right aligned under frequency, 149 00:07:52,610 --> 00:07:56,140 so each of these is also displayed in a right aligned 150 00:07:56,140 --> 00:07:58,850 field of 13 characters. 151 00:07:58,850 --> 00:08:02,120 That's what produced the output that you saw 152 00:08:02,120 --> 00:08:05,713 in my interactive IPython session here. 153 00:08:06,770 --> 00:08:08,890 I did promise you that we would talk a little bit more 154 00:08:08,890 --> 00:08:11,810 about this reproducibility issue. 155 00:08:11,810 --> 00:08:14,840 It turns about that the random module has built 156 00:08:14,840 --> 00:08:18,260 into it the capability of doing 157 00:08:18,260 --> 00:08:21,660 what we call seeding the random number generator. 158 00:08:21,660 --> 00:08:24,390 There is a seed function which when you call 159 00:08:24,390 --> 00:08:26,470 it with the same value, 160 00:08:26,470 --> 00:08:30,340 is going to restart the random number generation 161 00:08:30,340 --> 00:08:33,510 from that particular seed value. 162 00:08:33,510 --> 00:08:35,930 Basically, it's not true random numbers 163 00:08:35,930 --> 00:08:38,840 that you're getting but rather pseudo-random numbers 164 00:08:38,840 --> 00:08:41,780 that are produced by a complex calculation 165 00:08:41,780 --> 00:08:43,050 underneath the hood. 166 00:08:43,050 --> 00:08:46,070 That complex calculation by default starts 167 00:08:46,070 --> 00:08:48,270 with a seed based on the system clock 168 00:08:48,270 --> 00:08:50,170 in most implementations. 169 00:08:50,170 --> 00:08:54,860 But if you seed it with a fixed value, then you can get 170 00:08:54,860 --> 00:08:58,010 the same sequence of values each time. 171 00:08:58,010 --> 00:09:00,590 Let's demonstrate that capability. 172 00:09:00,590 --> 00:09:04,283 Let's call random.seed and we'll give it the value 32. 173 00:09:05,356 --> 00:09:10,113 Let's recall the for loop that we did up above. 174 00:09:11,230 --> 00:09:15,100 For anything that we do from this point forward 175 00:09:15,100 --> 00:09:18,260 is going to start with the sequence 176 00:09:18,260 --> 00:09:20,380 seeded at the number 32. 177 00:09:20,380 --> 00:09:22,690 If I execute this, I'll get 10 values 178 00:09:22,690 --> 00:09:25,010 and they'll be different from what we had up above 179 00:09:25,010 --> 00:09:28,310 because these started from different seed point. 180 00:09:28,310 --> 00:09:31,240 If I execute the same loop again, 181 00:09:31,240 --> 00:09:32,790 I'm going to get different values 182 00:09:32,790 --> 00:09:37,060 because it's going to continue from wherever we left off 183 00:09:37,060 --> 00:09:39,690 in the preceding randrange call. 184 00:09:39,690 --> 00:09:43,160 So we do in fact get different values here down below, 185 00:09:43,160 --> 00:09:46,030 so you're not seeing the reproducibility yet. 186 00:09:46,030 --> 00:09:49,650 The way you see the reproducibility is by recalling 187 00:09:49,650 --> 00:09:53,910 that snippet that seeds the random number generator. 188 00:09:53,910 --> 00:09:57,210 Now, we're starting over from the same point we started 189 00:09:57,210 --> 00:09:59,610 at back in snippet number six. 190 00:09:59,610 --> 00:10:02,440 And now if I recall that for loop, 191 00:10:02,440 --> 00:10:06,230 you can see that the values produced here are identical 192 00:10:06,230 --> 00:10:10,410 to the values produced back in snippet number seven. 193 00:10:10,410 --> 00:10:13,840 If you want to produce 194 00:10:13,840 --> 00:10:18,710 random data for simulation purposes and you need the ability 195 00:10:18,710 --> 00:10:22,170 for somebody to get the same set of random 196 00:10:22,170 --> 00:10:24,420 values for testing purposes, 197 00:10:24,420 --> 00:10:27,220 you can seed the random number generator 198 00:10:27,220 --> 00:10:31,180 to a specific value, then produce all your random numbers, 199 00:10:31,180 --> 00:10:32,570 then somebody else coming along 200 00:10:32,570 --> 00:10:35,630 and running your same code with the same seed 201 00:10:35,630 --> 00:10:38,623 should get the exact same results.