1
00:00:00,960 --> 00:00:03,190
- In this and the next
couple of videos we're

2
00:00:03,190 --> 00:00:07,360
going to use pandas to
read in a popular data

3
00:00:07,360 --> 00:00:09,820
set that's often used
by people learning about

4
00:00:09,820 --> 00:00:12,940
data science. And work
with that data set in the

5
00:00:12,940 --> 00:00:16,794
context of pandas to explore that data.

6
00:00:16,794 --> 00:00:19,260
Now, on the screen at the moment

7
00:00:19,260 --> 00:00:22,350
I have a URL for a popular data set

8
00:00:22,350 --> 00:00:25,070
repository called R Data Sets.

9
00:00:25,070 --> 00:00:28,760
This repository has over
1100 comma separated

10
00:00:28,760 --> 00:00:31,760
value data sets that you
can play with , and one

11
00:00:31,760 --> 00:00:35,790
of them is the titanic
disaster data set which

12
00:00:35,790 --> 00:00:39,660
has information about
just over 1300 passengers

13
00:00:39,660 --> 00:00:42,058
who were on the ship,
and what their fate was

14
00:00:42,058 --> 00:00:46,171
when the ship sank. Now
I wanted to show you this

15
00:00:46,171 --> 00:00:49,180
just so you have a sense
of the fact that there

16
00:00:49,180 --> 00:00:52,160
are these kind of massive
repositories out there

17
00:00:52,160 --> 00:00:55,290
in which you can find
data to work with and

18
00:00:55,290 --> 00:00:58,130
learn from as you get into data science

19
00:00:58,130 --> 00:01:00,220
and machine learning
and deep learning. The

20
00:01:00,220 --> 00:01:02,380
particular data set we'll
be working with like

21
00:01:02,380 --> 00:01:05,610
I said has just over
1300 records , as you can

22
00:01:05,610 --> 00:01:08,040
see some of these have many fewer records.

23
00:01:08,040 --> 00:01:11,795
Some of them have many
more records , but these

24
00:01:11,795 --> 00:01:16,795
are just 1 data set repository of the many

25
00:01:19,590 --> 00:01:22,360
that are out there from
which you can obtain data

26
00:01:22,360 --> 00:01:24,810
to work with for your own studies and for

27
00:01:24,810 --> 00:01:27,075
learning purposes as well.

28
00:01:27,075 --> 00:01:29,430
I just want to show you
that if I search on this

29
00:01:29,430 --> 00:01:33,850
page for titanic, you'll
see that the first hit

30
00:01:33,850 --> 00:01:37,410
is the titanic survival
data set and what I did

31
00:01:37,410 --> 00:01:41,200
to use this with pandas
is I right clicked on

32
00:01:41,200 --> 00:01:45,120
the CSV link over here
and copied the URL so that

33
00:01:45,120 --> 00:01:48,954
I can load the data set
directly into pandas.

34
00:01:48,954 --> 00:01:53,150
With that said let's
switch over to a terminal

35
00:01:53,150 --> 00:01:55,730
window here , you can
see I've already imported

36
00:01:55,730 --> 00:01:57,070
pandas.

37
00:01:57,070 --> 00:01:59,820
What we're going to do next is load up

38
00:01:59,820 --> 00:02:03,310
that data set into a pandas data frame.

39
00:02:03,310 --> 00:02:06,950
Because the URL was long I broke it into

40
00:02:06,950 --> 00:02:09,574
two pieces here and
concatenated the strings,

41
00:02:09,574 --> 00:02:11,480
but you can see we're using that

42
00:02:11,480 --> 00:02:15,127
same read CSV function
that we demonstrated

43
00:02:15,127 --> 00:02:19,610
in the proceeding video
to load from a URL this

44
00:02:19,610 --> 00:02:23,870
time instead of from a local file on disk.

45
00:02:23,870 --> 00:02:26,466
So let's go ahead and read that in.

46
00:02:26,466 --> 00:02:31,466
I'm also going to set
a setting within pandas

47
00:02:31,703 --> 00:02:35,100
to enable me to display
floating point numbers

48
00:02:35,100 --> 00:02:38,080
with two digits to the
right of the decimal point

49
00:02:38,080 --> 00:02:41,517
and that will enable me
to compact the output

50
00:02:41,517 --> 00:02:44,120
in this example a little bit .

51
00:02:44,120 --> 00:02:47,336
Set option is a function
in the pandas library

52
00:02:47,336 --> 00:02:50,440
when you pass it the string precision

53
00:02:50,440 --> 00:02:53,100
that indicates that your
setting for all floating

54
00:02:53,100 --> 00:02:56,830
point numbers that are
manipulated with pandas

55
00:02:56,830 --> 00:02:59,530
their display precision to be 2.

56
00:02:59,530 --> 00:03:03,330
And that's for onscreen
formatting purposes.

57
00:03:03,330 --> 00:03:05,740
I'll go ahead and set that as well.

58
00:03:05,740 --> 00:03:08,590
One of the great thing's
about pandas is that

59
00:03:08,590 --> 00:03:12,580
it makes it really easy
to explore your data,

60
00:03:12,580 --> 00:03:14,700
which is a key thing you need to do.

61
00:03:14,700 --> 00:03:17,480
You need to get to know
your data before you can

62
00:03:17,480 --> 00:03:19,150
work with that data.

63
00:03:19,150 --> 00:03:23,200
For example I can go ahead
and type titanic.head

64
00:03:23,200 --> 00:03:26,120
it will give me just the first few rows

65
00:03:26,120 --> 00:03:28,603
of the titanic data set to look at.

66
00:03:28,603 --> 00:03:31,980
The indices are just the indices that

67
00:03:31,980 --> 00:03:33,950
are provided automatically by pandas

68
00:03:33,950 --> 00:03:36,732
starting from index number
0 for the first row.

69
00:03:36,732 --> 00:03:39,336
We have several columns
of information here.

70
00:03:39,336 --> 00:03:41,760
This one is not named very well, but it's

71
00:03:41,760 --> 00:03:44,520
the column that specifies passenger names.

72
00:03:44,520 --> 00:03:46,860
We have a column called
survived which indicates

73
00:03:46,860 --> 00:03:49,370
weather they survived the disaster or not.

74
00:03:49,370 --> 00:03:52,070
We have a sex column
which indicates weather

75
00:03:52,070 --> 00:03:55,070
they were male or female, a age column for

76
00:03:55,070 --> 00:03:57,610
their age at the time
of the disaster, and a

77
00:03:57,610 --> 00:04:01,240
passenger class column
that indicates which class

78
00:04:01,240 --> 00:04:04,890
they were traveling on the ship.

79
00:04:04,890 --> 00:04:07,300
Now just like you can
look at the first few, you

80
00:04:07,300 --> 00:04:09,620
also have the ability
to look at the last few

81
00:04:09,620 --> 00:04:11,130
with the tail method .

82
00:04:11,130 --> 00:04:14,760
So these are the last 5
records of the data set

83
00:04:14,760 --> 00:04:17,990
and all these folks were
3rd class passengers.

84
00:04:17,990 --> 00:04:21,200
By the way notice NaN
here in record number 305

85
00:04:21,200 --> 00:04:25,583
that person they didn't have a age for, so

86
00:04:25,583 --> 00:04:30,532
that's a missing piece of
data in this particular

87
00:04:30,532 --> 00:04:31,803
data set.

88
00:04:33,220 --> 00:04:36,160
At this point we've now
displayed just a few of

89
00:04:36,160 --> 00:04:37,850
the records so we can get a sense

90
00:04:38,691 --> 00:04:40,810
of what information is
there. Let's also make

91
00:04:40,810 --> 00:04:44,077
a change. These column
names are not as perhaps

92
00:04:44,077 --> 00:04:47,790
as readable as they could
be, it would be nice if

93
00:04:47,790 --> 00:04:50,203
this one was simply
called name and here this

94
00:04:50,203 --> 00:04:54,140
one's really wide and we
have pretty narrow data,

95
00:04:54,140 --> 00:04:57,320
so let's also rename that one as well.

96
00:04:57,320 --> 00:04:59,850
To do that I'm going
to set the data frames

97
00:04:59,850 --> 00:05:03,812
columns attribute and
I have 5 columns here

98
00:05:03,812 --> 00:05:06,922
and I'm going to set all
of those column names.

99
00:05:06,922 --> 00:05:09,540
For a few of them we're
using the same names they

100
00:05:09,540 --> 00:05:11,840
had previously. But the
first column will now be

101
00:05:11,840 --> 00:05:13,970
called name, the last column will now

102
00:05:13,970 --> 00:05:17,720
be called class. After
doing that if I go ahead

103
00:05:17,720 --> 00:05:21,010
and display the head of
the data's frame once

104
00:05:21,010 --> 00:05:23,200
again you can see we now have our

105
00:05:23,200 --> 00:05:26,470
customized column heads being displayed

106
00:05:26,470 --> 00:05:28,670
across the top of the data frame.

107
00:05:28,670 --> 00:05:31,490
So at this point we've
loaded the data frame,

108
00:05:31,490 --> 00:05:33,770
we've started to get to know the data

109
00:05:33,770 --> 00:05:36,240
in the next video we're
going to do a little bit

110
00:05:36,240 --> 00:05:39,313
of simple data analysis with the data set.