1 00:00:01,440 --> 00:00:03,390 - [Instructor] Next, let's continue the session 2 00:00:03,390 --> 00:00:05,850 that we started in the preceding video 3 00:00:05,850 --> 00:00:09,320 and in particular, let's continue to learn about our data 4 00:00:09,320 --> 00:00:13,240 by looking at some simple data analysis. 5 00:00:13,240 --> 00:00:14,880 So first of all, let's go ahead 6 00:00:14,880 --> 00:00:19,880 and execute a titanic dot describe method call. 7 00:00:19,910 --> 00:00:22,180 Now before I hit Enter, notice again, 8 00:00:22,180 --> 00:00:24,920 we have five columns of data, however, 9 00:00:24,920 --> 00:00:27,640 when I described the titanic data set, 10 00:00:27,640 --> 00:00:30,380 I only get one column of output 11 00:00:30,380 --> 00:00:32,500 and the reason for that is by default, 12 00:00:32,500 --> 00:00:35,610 the described method looks for columns 13 00:00:35,610 --> 00:00:38,870 that contain numeric data and only calculates 14 00:00:38,870 --> 00:00:43,490 these descriptive statistics on the numeric columns. 15 00:00:43,490 --> 00:00:46,540 In a moment, I'll take a look at a non numeric column 16 00:00:46,540 --> 00:00:50,220 and we'll do some descriptive statistics on that as well. 17 00:00:50,220 --> 00:00:53,120 So for the titanic's age column, 18 00:00:53,120 --> 00:00:54,650 first of all, I want you to notice 19 00:00:54,650 --> 00:00:58,540 that it only counted 1,046 records. 20 00:00:58,540 --> 00:00:59,563 Now, if you look up here, you can see 21 00:00:59,563 --> 00:01:02,810 that that the highest numbered record was 1,308 22 00:01:02,810 --> 00:01:06,940 and the first one was zero so there's 1,309 records 23 00:01:06,940 --> 00:01:10,030 in this data set or rows in this data frame, 24 00:01:10,030 --> 00:01:13,890 yet it only counted 1,046 in the age column 25 00:01:13,890 --> 00:01:16,660 because some of the ages are missing. 26 00:01:16,660 --> 00:01:20,100 So we see one missing value for the age here 27 00:01:20,100 --> 00:01:23,760 represented is not a number and in pandas, 28 00:01:23,760 --> 00:01:26,980 when you describe a column by default, 29 00:01:26,980 --> 00:01:29,430 it ignores the missing data. 30 00:01:29,430 --> 00:01:33,880 So it counted up only the rows of the data set 31 00:01:33,880 --> 00:01:36,470 that actually had an age value 32 00:01:36,470 --> 00:01:39,190 and there were 1,046 of those. 33 00:01:39,190 --> 00:01:42,420 For those 1,046 people, we can see 34 00:01:42,420 --> 00:01:45,020 that the average age was close to 30, 35 00:01:45,020 --> 00:01:46,540 we can see that the minimum age 36 00:01:46,540 --> 00:01:48,640 was only point one seven years, 37 00:01:48,640 --> 00:01:51,290 which works out to a little over two months, 38 00:01:51,290 --> 00:01:54,530 we can see that the maximum age was 80 39 00:01:54,530 --> 00:01:57,600 and remember that the quartile values, 40 00:01:57,600 --> 00:02:02,600 50% quartile represents the median of the sorted age values. 41 00:02:02,640 --> 00:02:06,920 So the median age was 28, the median age 42 00:02:06,920 --> 00:02:11,920 for the first half of the folks who had age values was 21 43 00:02:12,500 --> 00:02:15,290 and the median age for the last half 44 00:02:15,290 --> 00:02:18,970 of the folks who had age values was 39. 45 00:02:18,970 --> 00:02:21,340 So we can see some basic descriptive 46 00:02:21,340 --> 00:02:23,670 statistics for the age here. 47 00:02:23,670 --> 00:02:26,590 Now let's assume for a moment that we'd like 48 00:02:26,590 --> 00:02:28,550 to know some information about 49 00:02:28,550 --> 00:02:32,850 how many people survived versus did not survive. 50 00:02:32,850 --> 00:02:35,830 So we can actually figure that out 51 00:02:35,830 --> 00:02:38,320 by selecting from the data frame, 52 00:02:38,320 --> 00:02:43,090 a series that checks whether each person survived or not. 53 00:02:43,090 --> 00:02:45,540 So let's go ahead and copy another 54 00:02:45,540 --> 00:02:48,170 snippet in here and we'll paste it in. 55 00:02:48,170 --> 00:02:50,410 So we're going to first execute 56 00:02:50,410 --> 00:02:52,370 this parenthesized expression, 57 00:02:52,370 --> 00:02:56,700 which uses the numpy style capability 58 00:02:56,700 --> 00:03:01,700 of broadcasting to compare every single survived value 59 00:03:01,760 --> 00:03:05,130 in the survived column with the value Yes, 60 00:03:05,130 --> 00:03:09,100 and the result is going to be true or false. 61 00:03:09,100 --> 00:03:12,040 I will get back as a result of this highlighted expression, 62 00:03:12,040 --> 00:03:15,540 a series of true, false values where everyone 63 00:03:15,540 --> 00:03:18,750 that compared equal to yes will be true 64 00:03:18,750 --> 00:03:20,570 and everyone that did not compare 65 00:03:20,570 --> 00:03:22,540 equal to yes will be false. 66 00:03:22,540 --> 00:03:25,540 Those that are true are the folks who survived, 67 00:03:25,540 --> 00:03:28,180 those that are false are the folks who died. 68 00:03:28,180 --> 00:03:31,020 Now I can call describe on that series, 69 00:03:31,020 --> 00:03:33,110 which is non numeric, by the way 70 00:03:33,110 --> 00:03:34,860 and it will give me some other 71 00:03:34,860 --> 00:03:38,340 descriptive statistics for that non numeric data. 72 00:03:38,340 --> 00:03:41,660 We still get account so we see 1,309 here, 73 00:03:41,660 --> 00:03:44,540 which means every single record had a yes 74 00:03:44,540 --> 00:03:48,920 or no value for that survived column. 75 00:03:48,920 --> 00:03:51,800 We can see that there's only two unique values 76 00:03:51,800 --> 00:03:54,160 in that column, in this case, true, 77 00:03:54,160 --> 00:03:57,020 they survived or false, they did not. 78 00:03:57,020 --> 00:03:59,400 And we can see that the top value was false, 79 00:03:59,400 --> 00:04:02,400 the most frequently occurring value was false 80 00:04:02,400 --> 00:04:05,780 and that occurred 809 times 81 00:04:05,780 --> 00:04:08,650 so 809 people died and the other 82 00:04:08,650 --> 00:04:12,833 500 passengers survived the disaster.