1 00:00:00,360 --> 00:00:02,570 - [Narrator] In this intro to Data Science video 2 00:00:02,570 --> 00:00:04,730 we're going to continue talking about 3 00:00:04,730 --> 00:00:07,231 some of the basic descriptive statistics 4 00:00:07,231 --> 00:00:10,035 that can help you get to know the data 5 00:00:10,035 --> 00:00:12,250 that you're working with. 6 00:00:12,250 --> 00:00:15,670 Now, previously we looked at measures of central tendency 7 00:00:15,670 --> 00:00:17,779 specifically the mean, median, and mode 8 00:00:17,779 --> 00:00:22,779 and those helped us categorize typical values in a group. 9 00:00:23,820 --> 00:00:25,930 So for example, if you're trying to figure out 10 00:00:25,930 --> 00:00:28,580 the average height of a bunch of classmates 11 00:00:28,580 --> 00:00:32,250 that would be the mean height of those classmates. 12 00:00:32,250 --> 00:00:37,250 And similarly, if you're determining what car brand 13 00:00:37,490 --> 00:00:40,690 is most frequently purchased in a given country, 14 00:00:40,690 --> 00:00:44,210 we would have to find out all the different car brands 15 00:00:44,210 --> 00:00:45,580 that are sold in that country, 16 00:00:45,580 --> 00:00:47,830 we would have to calculate the totals for each one, 17 00:00:47,830 --> 00:00:49,940 and the most frequently purchased one 18 00:00:49,940 --> 00:00:52,550 would be the mode for that country. 19 00:00:52,550 --> 00:00:56,400 Now, the group of data items that we're working with 20 00:00:56,400 --> 00:01:01,360 is known as a population, and if you work with 21 00:01:01,360 --> 00:01:05,950 a subset of that population, that is known as a sample. 22 00:01:05,950 --> 00:01:08,720 So for the purpose of our example, 23 00:01:08,720 --> 00:01:12,380 we'll be using the term population 24 00:01:12,380 --> 00:01:14,810 for a very small population of data. 25 00:01:14,810 --> 00:01:16,990 But as you get into big data 26 00:01:16,990 --> 00:01:18,972 and data analytics applications, 27 00:01:18,972 --> 00:01:22,700 your population of data may be billions 28 00:01:22,700 --> 00:01:24,820 or trillions of elements and you may 29 00:01:24,820 --> 00:01:27,038 randomly select a subset of those 30 00:01:27,038 --> 00:01:32,038 and process a sample for analytics purposes. 31 00:01:32,150 --> 00:01:33,510 Now, in this video, we're going to 32 00:01:33,510 --> 00:01:36,300 focus on measures of dispersion, 33 00:01:36,300 --> 00:01:38,900 which are also called measures of variability, 34 00:01:38,900 --> 00:01:41,510 and basically they try to help you 35 00:01:41,510 --> 00:01:45,300 understand how spread out your values are. 36 00:01:45,300 --> 00:01:47,390 Now, you may recall, in an earlier lesson, 37 00:01:47,390 --> 00:01:50,190 we talked about the range of values. 38 00:01:50,190 --> 00:01:52,453 And of course, there's a problem with the range 39 00:01:52,453 --> 00:01:55,945 which is you don't get a sense of how distributed 40 00:01:55,945 --> 00:01:58,880 the values are throughout that range. 41 00:01:58,880 --> 00:02:02,269 And you could have a narrow range of values, 42 00:02:02,269 --> 00:02:05,548 where all the values are at one end of the spectrum, 43 00:02:05,548 --> 00:02:07,690 or you could have a wide range of values, 44 00:02:07,690 --> 00:02:10,010 where all the values are at one end of the spectrum 45 00:02:10,010 --> 00:02:12,290 and just a couple are at the other. 46 00:02:12,290 --> 00:02:16,000 They may not be evenly distributed throughout the range. 47 00:02:16,000 --> 00:02:19,210 So, measures of dispersion give us a better reflection 48 00:02:19,210 --> 00:02:22,760 of how the data is spread out. 49 00:02:22,760 --> 00:02:25,330 Now, we're going to be talking about two measures 50 00:02:25,330 --> 00:02:28,980 of dispersion: the variance and the standard deviation. 51 00:02:28,980 --> 00:02:31,040 And we're going to talk about variance first, 52 00:02:31,040 --> 00:02:33,160 because the standard deviation is simply 53 00:02:33,160 --> 00:02:35,180 the square root of the variance. 54 00:02:35,180 --> 00:02:36,450 So, once we have the variance, 55 00:02:36,450 --> 00:02:39,920 we can calculate standard deviation very quickly. 56 00:02:39,920 --> 00:02:41,530 For the purpose of this discussion, 57 00:02:41,530 --> 00:02:45,330 we're going to use these 10 six-sided die rolls. 58 00:02:45,330 --> 00:02:49,550 And let's talk about how you calculate the variance. 59 00:02:49,550 --> 00:02:50,960 So, to determine the variance, 60 00:02:50,960 --> 00:02:53,940 you start out by summing up these values 61 00:02:53,940 --> 00:02:55,855 and calculating their mean, their average. 62 00:02:55,855 --> 00:02:59,700 And if you add up these values and divide by 10, 63 00:02:59,700 --> 00:03:04,006 you'll see that the average of those values is in fact 3.5. 64 00:03:04,006 --> 00:03:06,590 The next step in determining the variance 65 00:03:06,590 --> 00:03:10,250 is to subtract that average from each individual value, 66 00:03:10,250 --> 00:03:12,840 which is also going to give you 10 values, 67 00:03:12,840 --> 00:03:14,570 some of which will be negative 68 00:03:14,570 --> 00:03:16,510 and some of which will be positive. 69 00:03:16,510 --> 00:03:19,203 So, for example, one minus 3.5 gives you -2.5. 70 00:03:20,864 --> 00:03:23,447 Three minus 3.5 gives you -0.5. 71 00:03:24,640 --> 00:03:29,560 Four minus 3.5 gives you +0.5, etc. 72 00:03:29,560 --> 00:03:31,100 Now, once you have those differences, 73 00:03:31,100 --> 00:03:33,620 you then square those differences. 74 00:03:33,620 --> 00:03:36,700 And squaring the differences has the effect 75 00:03:36,700 --> 00:03:40,920 of emphasizing outliers in your data sets. 76 00:03:40,920 --> 00:03:42,927 We'll talk about that again in a second here. 77 00:03:42,927 --> 00:03:44,650 So, we square all the values, 78 00:03:44,650 --> 00:03:46,630 which gives us positive results, 79 00:03:46,630 --> 00:03:48,820 and then we sum all those values 80 00:03:48,820 --> 00:03:50,920 and calculate the average, the mean, 81 00:03:50,920 --> 00:03:55,610 to get what's known as the population variance. 82 00:03:55,610 --> 00:03:59,250 So, as a result of the original set of values 83 00:03:59,250 --> 00:04:03,963 that we have here, the variance in those values is 2.25. 84 00:04:05,220 --> 00:04:08,840 Now, when you square those differences, 85 00:04:08,840 --> 00:04:11,450 again, that emphasizes the outliers, 86 00:04:11,450 --> 00:04:14,580 those are the values that are the furthest from 87 00:04:14,580 --> 00:04:19,426 the original average of all of the data in the population. 88 00:04:19,426 --> 00:04:24,426 So, when you start getting into data analytics applications, 89 00:04:24,960 --> 00:04:28,260 sometimes you want to focus on outliers. 90 00:04:28,260 --> 00:04:31,361 So, for example, if you are a company 91 00:04:31,361 --> 00:04:34,790 that's trying to watch for credit card fraud, 92 00:04:34,790 --> 00:04:38,340 outliers would be strange transactions, 93 00:04:38,340 --> 00:04:42,320 and you may want to focus on those in a scenario like that. 94 00:04:42,320 --> 00:04:44,485 In other data analytics applications, 95 00:04:44,485 --> 00:04:47,262 an outlier, you may simply want to get rid of 96 00:04:47,262 --> 00:04:51,530 because it's just not a case that is going to occur 97 00:04:51,530 --> 00:04:56,330 and it really has no overall effect on the analytical data 98 00:04:56,330 --> 00:04:59,090 that you want to obtain from your study. 99 00:04:59,090 --> 00:05:01,360 So, it varies depending on 100 00:05:01,360 --> 00:05:03,800 what you want to do with the data, 101 00:05:03,800 --> 00:05:07,440 but it does, in terms of calculating the variance, 102 00:05:07,440 --> 00:05:12,440 it emphasizes those outliers so that you can see 103 00:05:12,440 --> 00:05:16,620 how far they are from the mean in your data set. 104 00:05:16,620 --> 00:05:18,800 So, let's take a moment to switch over 105 00:05:18,800 --> 00:05:23,800 to an iPython terminal where I've already executed some code 106 00:05:24,148 --> 00:05:27,317 that calculates the population variance 107 00:05:27,317 --> 00:05:31,560 for the specific set of values we were just discussing. 108 00:05:31,560 --> 00:05:35,388 So, as you can see here, I imported the statistics module. 109 00:05:35,388 --> 00:05:39,825 It has a population variance function called pvariance 110 00:05:39,825 --> 00:05:42,998 in the module, so statistics.pvariance. 111 00:05:42,998 --> 00:05:47,570 It takes as it's argument a sequence of numeric values, 112 00:05:47,570 --> 00:05:50,300 in our case, it happens to be a sequence of integers, 113 00:05:50,300 --> 00:05:53,160 and you can see that indeed, we got the same result 114 00:05:53,160 --> 00:05:57,048 that we discussed in the slide that we just came from. 115 00:05:57,048 --> 00:06:01,700 Now, like I mentioned, the standard deviation 116 00:06:01,700 --> 00:06:02,900 is very straightforward. 117 00:06:02,900 --> 00:06:06,730 It is simply the square root of the variance, 118 00:06:06,730 --> 00:06:10,217 and taking the square root of the variance actually 119 00:06:10,217 --> 00:06:15,217 tones down the effect of the outliers in your data set. 120 00:06:15,710 --> 00:06:19,700 So, I will talk about why standard deviation 121 00:06:19,700 --> 00:06:22,430 might be a better measurement in just a moment. 122 00:06:22,430 --> 00:06:26,118 So, in terms of calculating the standard deviation, 123 00:06:26,118 --> 00:06:29,049 we can do that a couple of different ways. 124 00:06:29,049 --> 00:06:33,500 We can use the statistics modules function 125 00:06:33,500 --> 00:06:38,500 for population standard deviation, which is pstdev. 126 00:06:40,340 --> 00:06:42,718 P for population, std for standard- 127 00:06:42,718 --> 00:06:46,218 or st for standard, and dev for deviation. 128 00:06:47,068 --> 00:06:51,130 And if I execute that, you can see that the result is 1.5, 129 00:06:51,130 --> 00:06:53,746 and to confirm that, indeed, 130 00:06:53,746 --> 00:06:57,909 this is the square root of the value up above, 131 00:06:57,909 --> 00:07:01,012 we can actually use the math module to help us with that. 132 00:07:01,012 --> 00:07:03,089 So, let's import the math module. 133 00:07:03,089 --> 00:07:05,006 Of course, you probably can tell that 1.5 134 00:07:05,006 --> 00:07:07,000 is the square root of 2.25, 135 00:07:07,000 --> 00:07:10,381 or that 2.25 is the square of 1.5. 136 00:07:10,381 --> 00:07:14,760 But remember the math module has a square root method. 137 00:07:14,760 --> 00:07:17,042 So, let's use the math square root method 138 00:07:17,042 --> 00:07:20,923 and let's pass it the results of this code and 139 00:07:20,923 --> 00:07:25,740 snip it to here, which calculates the population variance. 140 00:07:25,740 --> 00:07:27,844 So, if this gives us the same result, 141 00:07:27,844 --> 00:07:32,760 that is the population standard deviation. 142 00:07:32,760 --> 00:07:35,924 So, some basic descriptive statistics to help you 143 00:07:35,924 --> 00:07:40,924 understand how spread out your values are around the mean. 144 00:07:41,610 --> 00:07:45,370 So, in terms of the standard deviation for these values, 145 00:07:45,370 --> 00:07:50,370 the average was 3.5, and the spread around 3.5 was 1.5 146 00:07:52,330 --> 00:07:57,150 for the standard deviation, and 2.25 for the variance, 147 00:07:57,150 --> 00:07:59,540 which emphasizes the values that 148 00:07:59,540 --> 00:08:02,040 are further away from the mean. 149 00:08:02,040 --> 00:08:04,440 Now, going back to the slides for a second here, 150 00:08:05,990 --> 00:08:08,240 the smaller those values are 151 00:08:08,240 --> 00:08:10,720 for the variance and the standard deviation, 152 00:08:10,720 --> 00:08:14,400 the closer your overall data set is 153 00:08:14,400 --> 00:08:17,360 to the mean value in that data set. 154 00:08:17,360 --> 00:08:21,393 So, smaller values mean closer to the mean. 155 00:08:22,600 --> 00:08:24,910 Now, in terms of the advantage of 156 00:08:24,910 --> 00:08:27,254 standard deviation versus variance, 157 00:08:27,254 --> 00:08:31,400 when you are working with the variance 158 00:08:31,400 --> 00:08:34,080 you wind up with units that are 159 00:08:34,080 --> 00:08:36,260 different from your original data. 160 00:08:36,260 --> 00:08:38,850 Standard deviation has the same units 161 00:08:38,850 --> 00:08:40,800 as your original measurements. 162 00:08:40,800 --> 00:08:44,730 So for example, let's just suppose you're using 163 00:08:44,730 --> 00:08:48,060 temperatures in the month of March, Fahrenheit temperatures. 164 00:08:48,060 --> 00:08:50,600 So you wind up, well around here anyway 165 00:08:50,600 --> 00:08:52,297 where it gets cold in the month of March, 166 00:08:52,297 --> 00:08:55,917 we might have 31 days worth of average temperatures 167 00:08:55,917 --> 00:08:58,325 and they may be numbers like these. 168 00:08:58,325 --> 00:09:01,030 Maybe a little higher, maybe a little lower. 169 00:09:01,030 --> 00:09:02,590 This particular March we had some 170 00:09:02,590 --> 00:09:05,270 pretty cold days around my neighborhood. 171 00:09:05,270 --> 00:09:08,590 But in any case, the unit of measurement 172 00:09:08,590 --> 00:09:11,070 for these temperatures is degrees. 173 00:09:11,070 --> 00:09:14,033 And when you're working with the variance, 174 00:09:14,033 --> 00:09:18,226 the unit of measurement becomes degrees squared 175 00:09:18,226 --> 00:09:21,880 because you're actually calculating squares 176 00:09:21,880 --> 00:09:24,610 of the differences between the temperatures 177 00:09:24,610 --> 00:09:26,760 and the average temperature. 178 00:09:26,760 --> 00:09:28,570 So, you wind up with, even though 179 00:09:28,570 --> 00:09:30,810 those initial measurements are in degrees, 180 00:09:30,810 --> 00:09:33,900 when you square the value, you now have degrees squared. 181 00:09:33,900 --> 00:09:38,900 So, that's not the same unit as your original measurement. 182 00:09:39,330 --> 00:09:41,420 By taking the standard deviation, 183 00:09:41,420 --> 00:09:43,560 which is the square root of the variance, 184 00:09:43,560 --> 00:09:47,060 you actually go back to the original measurement. 185 00:09:47,060 --> 00:09:50,100 So, if we find out that the standard deviation 186 00:09:50,100 --> 00:09:53,670 for our temperatures in March is two, 187 00:09:53,670 --> 00:09:57,320 that would mean a two degree difference from the mean 188 00:09:57,320 --> 00:10:01,693 in a given year of average temperature values.