1 00:00:01,070 --> 00:00:02,810 - [Instructor] In the preceding lesson's Intro 2 00:00:02,810 --> 00:00:05,720 to Data Science section, we begin talking a little 3 00:00:05,720 --> 00:00:08,400 bit about basic descriptive statistics, 4 00:00:08,400 --> 00:00:10,880 and I mentioned the concept of getting 5 00:00:10,880 --> 00:00:12,580 to know your data. 6 00:00:12,580 --> 00:00:16,370 So a lot of the basics descriptive statistics 7 00:00:16,370 --> 00:00:18,550 are for that particular purpose. 8 00:00:18,550 --> 00:00:20,700 And another way to kind of get to know 9 00:00:20,700 --> 00:00:24,060 the data that you're working with is with measures 10 00:00:24,060 --> 00:00:26,270 of central tendency. 11 00:00:26,270 --> 00:00:29,580 Specifically, the mean, the median, and the mode. 12 00:00:29,580 --> 00:00:33,260 A measure of central tendency is some sort of individual 13 00:00:33,260 --> 00:00:37,320 value representing a central value 14 00:00:37,320 --> 00:00:39,410 in your set of values. 15 00:00:39,410 --> 00:00:42,320 Some way to think about the center point, 16 00:00:42,320 --> 00:00:44,920 or something that's in some way typical 17 00:00:44,920 --> 00:00:47,530 of the values in your data set. 18 00:00:47,530 --> 00:00:49,950 And as we work through these videos 19 00:00:49,950 --> 00:00:52,920 you're going to work with much larger data sets 20 00:00:52,920 --> 00:00:55,860 and you'll want to kind of explore the data 21 00:00:55,860 --> 00:00:58,570 and get a sense of the data before you actually 22 00:00:58,570 --> 00:01:00,180 work with it, so we'll starting talking 23 00:01:00,180 --> 00:01:03,010 about that in some of the later videos. 24 00:01:03,010 --> 00:01:07,140 Now the three measures of central tendency 25 00:01:07,140 --> 00:01:11,060 that we're going to take a look at in this video 26 00:01:11,060 --> 00:01:12,680 are the the mean, which is simply 27 00:01:12,680 --> 00:01:15,020 the average of a set of values. 28 00:01:15,020 --> 00:01:17,640 The median, which is the middle value 29 00:01:17,640 --> 00:01:19,850 in a set of values when you've already arranged 30 00:01:19,850 --> 00:01:24,110 those values into ascending order. 31 00:01:24,110 --> 00:01:28,310 And then finally, we're going to talk about mode. 32 00:01:28,310 --> 00:01:32,630 And the mode is simply the most frequently occurring value. 33 00:01:32,630 --> 00:01:35,140 Now there are some problems with mode, 34 00:01:35,140 --> 00:01:37,650 in that it could be that you have exactly 35 00:01:37,650 --> 00:01:40,480 the same number of each value in which case 36 00:01:40,480 --> 00:01:43,200 there is no mode, or you could have multiple values 37 00:01:44,320 --> 00:01:46,940 that have the most frequently occurring number 38 00:01:46,940 --> 00:01:50,870 of instances, and in those cases the libraries 39 00:01:50,870 --> 00:01:53,930 that you use to calculate the mode, 40 00:01:53,930 --> 00:01:56,790 will often simply give you an exception 41 00:01:56,790 --> 00:01:58,640 at execution time. 42 00:01:58,640 --> 00:02:02,000 So let me switch over to my Terminal window 43 00:02:02,000 --> 00:02:06,360 here and let's start-up a new ipython session. 44 00:02:06,360 --> 00:02:09,720 And for the purpose of this mean, median, and mode 45 00:02:09,720 --> 00:02:14,200 discussion, let's go ahead and create a list of grades, 46 00:02:14,200 --> 00:02:16,410 and as you saw earlier in this lesson, 47 00:02:16,410 --> 00:02:20,130 a list is simply a square bracket delimited 48 00:02:20,130 --> 00:02:21,810 set of values. 49 00:02:21,810 --> 00:02:24,767 For this purpose we're going to use the five values 50 00:02:24,767 --> 00:02:29,767 85, 93, 45, 89, and 85. 51 00:02:31,530 --> 00:02:33,470 So we've got a five element list. 52 00:02:33,470 --> 00:02:35,410 Now you'll notice by the way that one 53 00:02:35,410 --> 00:02:38,500 of those five elements is duplicated, 54 00:02:38,500 --> 00:02:41,251 85 appears twice in the list. 55 00:02:41,251 --> 00:02:44,020 So, we've got this grades array 56 00:02:44,020 --> 00:02:48,730 and before we get into pre-defined libraries 57 00:02:48,730 --> 00:02:53,690 that enable you to do things like mean, median, and mode, 58 00:02:53,690 --> 00:02:56,240 let's go ahead and do the mean on our own 59 00:02:56,240 --> 00:02:59,350 using some built-in functions that are part 60 00:02:59,350 --> 00:03:01,260 of Python itself. 61 00:03:01,260 --> 00:03:03,870 So I think I mentioned this earlier in this lesson, 62 00:03:03,870 --> 00:03:06,450 there's are sum function that can be used 63 00:03:06,450 --> 00:03:11,210 to total up the values in a sequence of values, 64 00:03:11,210 --> 00:03:14,540 and there's also a function that helps you determine 65 00:03:14,540 --> 00:03:17,050 the length of a sequence. 66 00:03:17,050 --> 00:03:19,510 And if I know the total number of grades 67 00:03:19,510 --> 00:03:21,320 and I know the total of those grades, 68 00:03:21,320 --> 00:03:25,130 I can use those values to calculate the average 69 00:03:25,130 --> 00:03:27,380 grade, which is the the mean grade. 70 00:03:27,380 --> 00:03:30,370 So let's do that, we're gonna use the sum function. 71 00:03:30,370 --> 00:03:32,920 And the sum function is interesting because you can 72 00:03:32,920 --> 00:03:35,700 simply hand it an entire list of values 73 00:03:35,700 --> 00:03:38,610 no matter many elements are in it, and it will 74 00:03:38,610 --> 00:03:41,870 give you back the total of the items 75 00:03:41,870 --> 00:03:44,170 in that list. 76 00:03:44,170 --> 00:03:46,810 And I want to divide that by the number 77 00:03:46,810 --> 00:03:49,160 of elements in the list, and it turns out 78 00:03:49,160 --> 00:03:52,700 that there is a len for length function 79 00:03:52,700 --> 00:03:56,010 built-in to Python that can give you 80 00:03:56,010 --> 00:03:59,050 the total number of elements in the sequence 81 00:03:59,050 --> 00:04:00,960 you pass it as an argument. 82 00:04:00,960 --> 00:04:04,610 So in this case, the sequence is the grades list 83 00:04:04,610 --> 00:04:06,270 that we defined up above. 84 00:04:06,270 --> 00:04:08,660 So sum is going to calculate the total 85 00:04:08,660 --> 00:04:10,780 of these five values, and this is why 86 00:04:10,780 --> 00:04:13,190 I have not been declaring variables named 87 00:04:13,190 --> 00:04:15,500 sum all along. 88 00:04:15,500 --> 00:04:17,180 I've been calling them total. 89 00:04:17,180 --> 00:04:19,410 And then separately the len function 90 00:04:19,410 --> 00:04:21,360 will look at the list called grades, 91 00:04:21,360 --> 00:04:24,460 and every list knows how many elements it has, 92 00:04:24,460 --> 00:04:26,680 and the len function simply accesses 93 00:04:26,680 --> 00:04:29,420 that piece of data and gives it back. 94 00:04:29,420 --> 00:04:32,710 So len with grades is an argument is going to simply 95 00:04:32,710 --> 00:04:34,510 give me the value five. 96 00:04:34,510 --> 00:04:37,830 And when I press enter we can see that the mean 97 00:04:37,830 --> 00:04:40,460 grade, or the average grade based on the numbers 98 00:04:40,460 --> 00:04:43,300 up above, is 79.4. 99 00:04:43,300 --> 00:04:46,840 Now this is actually a demonstration 100 00:04:46,840 --> 00:04:51,220 of a little bit of functional style programming in Python. 101 00:04:51,220 --> 00:04:55,540 These two functions are examples of reductions. 102 00:04:55,540 --> 00:04:59,190 They look at a collection of values and give you back 103 00:04:59,190 --> 00:05:02,630 a single value that represents that collection 104 00:05:02,630 --> 00:05:03,620 in some way. 105 00:05:03,620 --> 00:05:06,280 In the first case, sum gives you the total 106 00:05:06,280 --> 00:05:10,160 of the items in the collection, and in the second case 107 00:05:10,160 --> 00:05:13,110 len just gives you the number of items that count, 108 00:05:13,110 --> 00:05:15,470 if you will, of items in the collection. 109 00:05:15,470 --> 00:05:18,460 And count was one of the descriptive statistics 110 00:05:18,460 --> 00:05:22,010 that we mentioned in the previous lessons 111 00:05:22,010 --> 00:05:24,180 Intro to Data Science section. 112 00:05:24,180 --> 00:05:27,650 So len is basically the count statistic. 113 00:05:27,650 --> 00:05:31,270 Now of course, one thing that's nice 114 00:05:31,270 --> 00:05:34,090 about using these pre-defined functions 115 00:05:34,090 --> 00:05:37,560 is we don't have to define our own looping structure here, 116 00:05:37,560 --> 00:05:40,510 and by not defining our own looping structure 117 00:05:40,510 --> 00:05:44,050 we're avoiding the common errors associated 118 00:05:44,050 --> 00:05:46,550 with looping where we might do something wrong, 119 00:05:46,550 --> 00:05:49,270 in terms of how we implement the loop 120 00:05:49,270 --> 00:05:50,470 in the first place. 121 00:05:50,470 --> 00:05:54,400 Here we're using what's known as internal iteration. 122 00:05:54,400 --> 00:05:57,250 The sum function already knows how to walk 123 00:05:57,250 --> 00:05:59,980 it's way through a sequence of numbers 124 00:05:59,980 --> 00:06:02,670 and calculate their total, we don't have to say 125 00:06:02,670 --> 00:06:05,320 how to do that, we just have to tell the function 126 00:06:05,320 --> 00:06:06,370 what we want. 127 00:06:06,370 --> 00:06:08,510 We want the total of all those grades, 128 00:06:08,510 --> 00:06:10,840 and it gives us back a result. 129 00:06:10,840 --> 00:06:12,980 So a lot of functional style programming 130 00:06:12,980 --> 00:06:17,040 in Python is done by delegating to functions 131 00:06:17,040 --> 00:06:19,870 built-in to the language, and delegating 132 00:06:19,870 --> 00:06:23,450 to functions or methods that are part 133 00:06:23,450 --> 00:06:26,100 of the standard library. 134 00:06:26,100 --> 00:06:29,380 Now along those lines I want introduce 135 00:06:29,380 --> 00:06:32,980 the statistics library, which is one of many 136 00:06:32,980 --> 00:06:37,400 Python standard library modules that's available 137 00:06:37,400 --> 00:06:40,450 to you in your Python installation. 138 00:06:40,450 --> 00:06:42,930 And you can access it's capabilities 139 00:06:42,930 --> 00:06:46,980 simply by importing the statistics module. 140 00:06:46,980 --> 00:06:50,490 Now previously when I did an import, 141 00:06:50,490 --> 00:06:55,490 I used the from module name import piece of the module 142 00:06:58,280 --> 00:07:00,350 format, so from import. 143 00:07:00,350 --> 00:07:03,930 Here I'm simply saying import the entire statistics module. 144 00:07:03,930 --> 00:07:07,490 Now when you do that, in order to access it's contents, 145 00:07:07,490 --> 00:07:09,720 you must use the name of the module 146 00:07:09,720 --> 00:07:13,130 followed by a dot, and then the name 147 00:07:13,130 --> 00:07:15,920 of whatever you want to use within that module. 148 00:07:15,920 --> 00:07:19,870 And it turns this module, among it's capabilities, 149 00:07:19,870 --> 00:07:24,870 has functions named mean, median, and mode 150 00:07:25,040 --> 00:07:27,010 that we can use to produce the descriptives 151 00:07:27,010 --> 00:07:29,450 statistics we talked about at the beginning 152 00:07:29,450 --> 00:07:30,800 of this video. 153 00:07:30,800 --> 00:07:34,290 So if I go ahead and start to type statistics 154 00:07:34,290 --> 00:07:36,560 and I hit tab, you can see there's 155 00:07:36,560 --> 00:07:40,510 a couple of things that start with stati, 156 00:07:40,510 --> 00:07:42,480 so I'm gonna select statistics, 157 00:07:42,480 --> 00:07:44,000 and then I'm gonna type a dot. 158 00:07:44,000 --> 00:07:46,260 And just to show you there's a bunch of stuff 159 00:07:46,260 --> 00:07:48,240 to find in the statistics module, 160 00:07:48,240 --> 00:07:50,990 watch what happens when I press the tab key here, 161 00:07:50,990 --> 00:07:55,420 it starts displaying a big list of all the different 162 00:07:55,420 --> 00:07:58,660 things that are part of the statistics module, 163 00:07:58,660 --> 00:08:02,880 and among them you can see here mean, median, and mode, 164 00:08:02,880 --> 00:08:04,630 and a lot of other things as well. 165 00:08:04,630 --> 00:08:08,150 Some of which we'll discuss in later examples. 166 00:08:08,150 --> 00:08:10,360 So let's say we wanna use the mean function, 167 00:08:10,360 --> 00:08:13,020 and again, like the sum function 168 00:08:13,020 --> 00:08:16,450 and the len function up above, mean knows 169 00:08:16,450 --> 00:08:19,150 how to walk it's way through a collection of items 170 00:08:19,150 --> 00:08:21,660 you simply have to tell it what you want 171 00:08:21,660 --> 00:08:23,980 to calculate the average of, in this case 172 00:08:23,980 --> 00:08:27,420 the list of grades, and when you execute it, it will 173 00:08:27,420 --> 00:08:30,410 go ahead and figure out not only how 174 00:08:30,410 --> 00:08:32,590 to total up those grades, but how many grades 175 00:08:32,590 --> 00:08:36,120 there were and produce correct average, 176 00:08:36,120 --> 00:08:38,770 or mean of those grades in this case. 177 00:08:38,770 --> 00:08:41,120 Let's go ahead and recall that and we'll change 178 00:08:41,120 --> 00:08:42,380 this to median. 179 00:08:42,380 --> 00:08:44,390 Now you may recall that the median 180 00:08:44,390 --> 00:08:47,620 depends on the grades being in sorted order 181 00:08:47,620 --> 00:08:49,630 to figure out the middle element. 182 00:08:49,630 --> 00:08:51,850 Well if you look at the grades up above here, 183 00:08:51,850 --> 00:08:54,260 they are not in sorted order. 184 00:08:54,260 --> 00:08:58,570 So the median function takes care of that for you. 185 00:08:58,570 --> 00:09:01,270 Then it goes and figures out the middle element, 186 00:09:01,270 --> 00:09:03,820 which in our case because we have five elements 187 00:09:03,820 --> 00:09:06,910 will be whatever appears as the middle position, 188 00:09:06,910 --> 00:09:08,640 it's an odd number of elements so we'll 189 00:09:08,640 --> 00:09:10,410 always have one middle. 190 00:09:10,410 --> 00:09:12,690 If there were an even number of elements 191 00:09:12,690 --> 00:09:16,570 the two middle elements would be averaged by default. 192 00:09:16,570 --> 00:09:18,080 So let's go ahead and get the median, 193 00:09:18,080 --> 00:09:20,690 and here we can see that the median is 85. 194 00:09:20,690 --> 00:09:23,390 So if we were to reorder these, we'd have 195 00:09:23,390 --> 00:09:28,390 45, 85, 85, 89, and 93, so the second of the 85 values 196 00:09:31,150 --> 00:09:32,620 would be in the middle position, 197 00:09:32,620 --> 00:09:35,550 and that's why got a median of 85. 198 00:09:35,550 --> 00:09:38,253 And similarly, let's go ahead and do the mode. 199 00:09:41,170 --> 00:09:44,370 And you can see in this case, excuse me, 200 00:09:44,370 --> 00:09:48,800 that the mode also is 85 because there are two 85s 201 00:09:48,800 --> 00:09:52,210 and only one each of the other three values 202 00:09:52,210 --> 00:09:54,750 in our original list. 203 00:09:54,750 --> 00:09:58,530 So those are the basic descriptive statistics, 204 00:09:58,530 --> 00:10:00,320 mean, median, and mode. 205 00:10:00,320 --> 00:10:01,980 And by the way as long as we talked 206 00:10:01,980 --> 00:10:03,890 about the fact that the median function 207 00:10:03,890 --> 00:10:07,070 is sorting the data for you, there is actually 208 00:10:07,070 --> 00:10:08,770 as you might expect in Python, 209 00:10:08,770 --> 00:10:13,090 a way to sort the data built into the language, 210 00:10:13,090 --> 00:10:15,960 via the function called Sorted. 211 00:10:15,960 --> 00:10:17,950 So let's go ahead and try that. 212 00:10:17,950 --> 00:10:20,840 We'll do sorted with grades as an argument. 213 00:10:20,840 --> 00:10:23,160 And what this is going to do, it won't modify 214 00:10:23,160 --> 00:10:26,880 the grades list, what it will do is make a copy 215 00:10:26,880 --> 00:10:28,530 of the list, but give it back 216 00:10:28,530 --> 00:10:31,870 to us in sorted, ascending order. 217 00:10:31,870 --> 00:10:35,980 So you can see this is the set of values sorted, 218 00:10:35,980 --> 00:10:38,380 and by the way now we can easily see 219 00:10:38,380 --> 00:10:42,520 what the median is which is the middle element 220 00:10:42,520 --> 00:10:45,340 in the sorted set of lists. 221 00:10:45,340 --> 00:10:47,873 The sorted set of values excuse me. 222 00:10:50,763 --> 00:10:52,250 Oh and by the way, just one last thing 223 00:10:52,250 --> 00:10:55,650 before I forget, if you go in evaluate grades 224 00:10:55,650 --> 00:10:57,630 you can see that the original list 225 00:10:57,630 --> 00:11:00,520 was left unchanged by the use of sorted 226 00:11:00,520 --> 00:11:01,903 back in SnipIt seven.