1
00:00:01,440 --> 00:00:03,390
- [Instructor] Next,
let's continue the session

2
00:00:03,390 --> 00:00:05,850
that we started in the preceding video

3
00:00:05,850 --> 00:00:09,320
and in particular, let's
continue to learn about our data

4
00:00:09,320 --> 00:00:13,240
by looking at some simple data analysis.

5
00:00:13,240 --> 00:00:14,880
So first of all, let's go ahead

6
00:00:14,880 --> 00:00:19,880
and execute a titanic
dot describe method call.

7
00:00:19,910 --> 00:00:22,180
Now before I hit Enter, notice again,

8
00:00:22,180 --> 00:00:24,920
we have five columns of data, however,

9
00:00:24,920 --> 00:00:27,640
when I described the titanic data set,

10
00:00:27,640 --> 00:00:30,380
I only get one column of output

11
00:00:30,380 --> 00:00:32,500
and the reason for that is by default,

12
00:00:32,500 --> 00:00:35,610
the described method looks for columns

13
00:00:35,610 --> 00:00:38,870
that contain numeric
data and only calculates

14
00:00:38,870 --> 00:00:43,490
these descriptive statistics
on the numeric columns.

15
00:00:43,490 --> 00:00:46,540
In a moment, I'll take a
look at a non numeric column

16
00:00:46,540 --> 00:00:50,220
and we'll do some descriptive
statistics on that as well.

17
00:00:50,220 --> 00:00:53,120
So for the titanic's age column,

18
00:00:53,120 --> 00:00:54,650
first of all, I want you to notice

19
00:00:54,650 --> 00:00:58,540
that it only counted 1,046 records.

20
00:00:58,540 --> 00:00:59,563
Now, if you look up here, you can see

21
00:00:59,563 --> 00:01:02,810
that that the highest
numbered record was 1,308

22
00:01:02,810 --> 00:01:06,940
and the first one was zero
so there's 1,309 records

23
00:01:06,940 --> 00:01:10,030
in this data set or
rows in this data frame,

24
00:01:10,030 --> 00:01:13,890
yet it only counted
1,046 in the age column

25
00:01:13,890 --> 00:01:16,660
because some of the ages are missing.

26
00:01:16,660 --> 00:01:20,100
So we see one missing
value for the age here

27
00:01:20,100 --> 00:01:23,760
represented is not a number and in pandas,

28
00:01:23,760 --> 00:01:26,980
when you describe a column by default,

29
00:01:26,980 --> 00:01:29,430
it ignores the missing data.

30
00:01:29,430 --> 00:01:33,880
So it counted up only
the rows of the data set

31
00:01:33,880 --> 00:01:36,470
that actually had an age value

32
00:01:36,470 --> 00:01:39,190
and there were 1,046 of those.

33
00:01:39,190 --> 00:01:42,420
For those 1,046 people, we can see

34
00:01:42,420 --> 00:01:45,020
that the average age was close to 30,

35
00:01:45,020 --> 00:01:46,540
we can see that the minimum age

36
00:01:46,540 --> 00:01:48,640
was only point one seven years,

37
00:01:48,640 --> 00:01:51,290
which works out to a
little over two months,

38
00:01:51,290 --> 00:01:54,530
we can see that the maximum age was 80

39
00:01:54,530 --> 00:01:57,600
and remember that the quartile values,

40
00:01:57,600 --> 00:02:02,600
50% quartile represents the
median of the sorted age values.

41
00:02:02,640 --> 00:02:06,920
So the median age was 28, the median age

42
00:02:06,920 --> 00:02:11,920
for the first half of the
folks who had age values was 21

43
00:02:12,500 --> 00:02:15,290
and the median age for the last half

44
00:02:15,290 --> 00:02:18,970
of the folks who had age values was 39.

45
00:02:18,970 --> 00:02:21,340
So we can see some basic descriptive

46
00:02:21,340 --> 00:02:23,670
statistics for the age here.

47
00:02:23,670 --> 00:02:26,590
Now let's assume for a
moment that we'd like

48
00:02:26,590 --> 00:02:28,550
to know some information about

49
00:02:28,550 --> 00:02:32,850
how many people survived
versus did not survive.

50
00:02:32,850 --> 00:02:35,830
So we can actually figure that out

51
00:02:35,830 --> 00:02:38,320
by selecting from the data frame,

52
00:02:38,320 --> 00:02:43,090
a series that checks whether
each person survived or not.

53
00:02:43,090 --> 00:02:45,540
So let's go ahead and copy another

54
00:02:45,540 --> 00:02:48,170
snippet in here and we'll paste it in.

55
00:02:48,170 --> 00:02:50,410
So we're going to first execute

56
00:02:50,410 --> 00:02:52,370
this parenthesized expression,

57
00:02:52,370 --> 00:02:56,700
which uses the numpy style capability

58
00:02:56,700 --> 00:03:01,700
of broadcasting to compare
every single survived value

59
00:03:01,760 --> 00:03:05,130
in the survived column with the value Yes,

60
00:03:05,130 --> 00:03:09,100
and the result is going
to be true or false.

61
00:03:09,100 --> 00:03:12,040
I will get back as a result of
this highlighted expression,

62
00:03:12,040 --> 00:03:15,540
a series of true, false
values where everyone

63
00:03:15,540 --> 00:03:18,750
that compared equal to yes will be true

64
00:03:18,750 --> 00:03:20,570
and everyone that did not compare

65
00:03:20,570 --> 00:03:22,540
equal to yes will be false.

66
00:03:22,540 --> 00:03:25,540
Those that are true are
the folks who survived,

67
00:03:25,540 --> 00:03:28,180
those that are false
are the folks who died.

68
00:03:28,180 --> 00:03:31,020
Now I can call describe on that series,

69
00:03:31,020 --> 00:03:33,110
which is non numeric, by the way

70
00:03:33,110 --> 00:03:34,860
and it will give me some other

71
00:03:34,860 --> 00:03:38,340
descriptive statistics
for that non numeric data.

72
00:03:38,340 --> 00:03:41,660
We still get account so we see 1,309 here,

73
00:03:41,660 --> 00:03:44,540
which means every single record had a yes

74
00:03:44,540 --> 00:03:48,920
or no value for that survived column.

75
00:03:48,920 --> 00:03:51,800
We can see that there's
only two unique values

76
00:03:51,800 --> 00:03:54,160
in that column, in this case, true,

77
00:03:54,160 --> 00:03:57,020
they survived or false, they did not.

78
00:03:57,020 --> 00:03:59,400
And we can see that the
top value was false,

79
00:03:59,400 --> 00:04:02,400
the most frequently
occurring value was false

80
00:04:02,400 --> 00:04:05,780
and that occurred 809 times

81
00:04:05,780 --> 00:04:08,650
so 809 people died and the other

82
00:04:08,650 --> 00:04:12,833
500 passengers survived the disaster.