1
00:00:00,000 --> 00:00:02,698
Hi, welcome back. When you're doing data

2
00:00:02,700 --> 00:00:05,190
analysis, the first thing you want to do

3
00:00:05,190 --> 00:00:07,890
is you want to familiarize yourself with

4
00:00:07,890 --> 00:00:10,770
your data using the tool, the data

5
00:00:10,770 --> 00:00:13,170
analysis tool that you have chosen to

6
00:00:13,170 --> 00:00:15,690
use for your particular project. In this

7
00:00:15,690 --> 00:00:18,840
case, the tool is Python. So what we're

8
00:00:18,840 --> 00:00:20,640
going to do in this video is we are

9
00:00:20,640 --> 00:00:24,030
going to load the data into Python using

10
00:00:24,052 --> 00:00:26,632
a Jupyter Notebook. And then look

11
00:00:26,670 --> 00:00:29,250
at those data and extract some very

12
00:00:29,250 --> 00:00:32,610
basic information about the data, such

13
00:00:32,635 --> 00:00:35,065
as looking at what the column names we have,

14
00:00:35,370 --> 00:00:38,310
how many rows we have, and other simple

15
00:00:38,430 --> 00:00:41,700
attributes of our data. So let's do

16
00:00:41,700 --> 00:00:43,950
that. I'd like to let you know that I

17
00:00:43,950 --> 00:00:46,390
created an empty folder named

18
00:00:46,391 --> 00:00:50,580
review_analysis, and I put the

19
00:00:50,610 --> 00:00:54,090
reviews.csv file in that folder.

20
00:00:54,870 --> 00:00:57,360
You can find this reviews.csv

21
00:00:57,360 --> 00:00:59,940
file attached to this lecture so as the

22
00:00:59,940 --> 00:01:01,860
lecture resource, so please download

23
00:01:01,890 --> 00:01:04,500
that and place it in a folder just like

24
00:01:04,500 --> 00:01:07,440
I did. And then you should be able to

25
00:01:07,440 --> 00:01:10,770
locate that folder from the Jupyter

26
00:01:10,927 --> 00:01:16,036
homepage, which is localhost:8888/tree.

27
00:01:16,319 --> 00:01:18,900
So I created that folder directly

28
00:01:19,110 --> 00:01:21,900
in my users folder. So this is the

29
00:01:21,900 --> 00:01:24,311
folder, I can click that and here is

30
00:01:24,313 --> 00:01:27,487
reviews.csv file. In your case,

31
00:01:27,488 --> 00:01:29,945
you can create this wherever you want, and then

32
00:01:29,946 --> 00:01:32,472
locate it using this directory tree here.

33
00:01:32,473 --> 00:01:34,927
[No audio]

34
00:01:34,929 --> 00:01:39,257
While I am in this folder from Jupyter, I can go to

35
00:01:39,258 --> 00:01:44,130
the New dropdown list and go to Python 3 to

36
00:01:44,130 --> 00:01:47,520
create a new Jupyter Notebook. So the

37
00:01:47,520 --> 00:01:49,890
Jupyter Notebook has been created, I can

38
00:01:49,890 --> 00:01:54,000
rename this to something else, the name

39
00:01:54,030 --> 00:01:56,310
of the Jupyter Notebook. Let's keep

40
00:01:56,310 --> 00:01:59,940
it simple and say reviews, Rename and

41
00:01:59,940 --> 00:02:05,070
that will get a .ipynb extension.

42
00:02:05,640 --> 00:02:08,100
Let me make some more room here. So that

43
00:02:08,100 --> 00:02:11,220
you see more code. I'm going to toggle

44
00:02:11,220 --> 00:02:13,560
the header off, and now let's start

45
00:02:13,560 --> 00:02:16,710
coding. The very first thing we want to

46
00:02:16,710 --> 00:02:20,340
do, of course, is import pandas, the

47
00:02:20,340 --> 00:02:22,650
library that is used to perform data

48
00:02:22,650 --> 00:02:25,530
analysis with Python, and then we want

49
00:02:25,530 --> 00:02:27,660
to create a variable which is going to

50
00:02:27,660 --> 00:02:30,179
hold the data, that is equal to

51
00:02:30,181 --> 00:02:33,330
pandas.read, since we are working with a

52
00:02:33,355 --> 00:02:36,595
csv file, the method we want to use

53
00:02:37,170 --> 00:02:41,430
out of the pandas library is read_csv.

54
00:02:42,150 --> 00:02:44,310
In parentheses, we want to pass in

55
00:02:44,310 --> 00:02:47,130
single quotes or double quotes, whatever

56
00:02:47,130 --> 00:02:51,330
you like, the path to the csv file. Now

57
00:02:51,330 --> 00:02:54,720
if you don't type in the name correctly,

58
00:02:54,720 --> 00:02:58,410
you're going to get an error. So let me

59
00:02:58,410 --> 00:03:02,760
try to execute this cell using Control

60
00:03:02,760 --> 00:03:04,920
Enter if you are on Windows, or Command

61
00:03:04,957 --> 00:03:09,127
Enter if you are on Mac, and as I warned you

62
00:03:09,152 --> 00:03:11,850
I got an error, it says No such file

63
00:03:11,850 --> 00:03:14,880
or directory, because I mistyped the

64
00:03:14,880 --> 00:03:18,840
file. So reviews. This time if I

65
00:03:18,840 --> 00:03:20,970
execute, I don't get an error, that

66
00:03:20,970 --> 00:03:25,110
means data was loaded successfully. And

67
00:03:25,110 --> 00:03:31,870
I can press Escape B Enter and call the data variable

68
00:03:32,213 --> 00:03:35,653
and Control Enter again to see the DataFrame.

69
00:03:35,654 --> 00:03:39,412
[No audio]

70
00:03:39,414 --> 00:03:44,079
So we can see that we have 1, 2, 3, 4 columns, and

71
00:03:44,080 --> 00:03:47,940
we also have this index column here added by pandas

72
00:03:47,940 --> 00:03:51,150
automatically. Basically, its a range

73
00:03:51,180 --> 00:03:53,640
of numbers starting from 0. So that

74
00:03:53,640 --> 00:03:56,250
is the first row of our data. This

75
00:03:56,250 --> 00:04:00,180
one in here, that has this index of 0.

76
00:04:01,800 --> 00:04:04,950
And it ends at the last row, you can see

77
00:04:04,950 --> 00:04:07,050
that this is the first row, second,

78
00:04:07,110 --> 00:04:11,280
third, fourth, five row and then Jupyter is

79
00:04:11,280 --> 00:04:14,250
not displaying the rows after row

80
00:04:14,310 --> 00:04:17,430
five, because there are a lot of rows

81
00:04:17,430 --> 00:04:21,270
and it'd be impractical to see them

82
00:04:21,270 --> 00:04:23,730
here. However, you get to see the last

83
00:04:23,946 --> 00:04:29,406
five rows of the DataFrame. And that

84
00:04:29,538 --> 00:04:31,950
gives you an overview of the DataFrame.

85
00:04:32,490 --> 00:04:36,500
However, what I like to do instead is just printout

86
00:04:36,768 --> 00:04:39,624
the head of the DataFrame, which is the first

87
00:04:39,649 --> 00:04:41,681
[No audio]

88
00:04:41,683 --> 00:04:46,725
rows only. So the first five rows, that gives you

89
00:04:46,726 --> 00:04:49,530
a more compact view. It gives you an idea

90
00:04:49,530 --> 00:04:51,750
what columns you have and what kind of

91
00:04:51,990 --> 00:04:54,780
rows you have also. So it's good to have

92
00:04:54,780 --> 00:04:56,910
a head of the DataFrame displayed in here.

93
00:04:57,180 --> 00:05:00,810
Then we can press Escape B and create a

94
00:05:00,810 --> 00:05:04,322
new cell here, Enter to write some other code.

95
00:05:05,541 --> 00:05:08,586
You can get the shape of the DataFrame

96
00:05:09,258 --> 00:05:11,460
by accessing the shape property.

97
00:05:11,610 --> 00:05:16,290
So that is a method with parentheses,

98
00:05:16,980 --> 00:05:19,380
that is a property, it doesn't need

99
00:05:19,380 --> 00:05:21,660
parentheses, and then you get to the

100
00:05:21,660 --> 00:05:23,991
shape of the DataFrame which is basically the

101
00:05:23,992 --> 00:05:27,661
number of rows and the number of columns.

102
00:05:27,662 --> 00:05:29,842
[No audio]

103
00:05:29,844 --> 00:05:32,881
You see 1, 2, 3, 4 columns.

104
00:05:32,906 --> 00:05:34,916
[No audio]

105
00:05:34,940 --> 00:05:41,026
You might also want to B Enter, display

106
00:05:41,050 --> 00:05:43,050
[No audio]

107
00:05:43,053 --> 00:05:45,810
the columns that your DataFrame has.

108
00:05:46,200 --> 00:05:48,990
Even though we have them there, this is

109
00:05:48,990 --> 00:05:51,900
yet an other way to see the names of the

110
00:05:51,900 --> 00:05:54,120
columns by accessing the columns

111
00:05:54,150 --> 00:05:56,940
property. Then next, usually, when you

112
00:05:56,940 --> 00:05:59,430
are working with data, you have some

113
00:05:59,430 --> 00:06:01,830
specific columns that you are interested

114
00:06:01,830 --> 00:06:04,680
about, which could be one column or

115
00:06:04,680 --> 00:06:06,750
more. In this case, we might be

116
00:06:06,750 --> 00:06:09,750
interested to see an overview of the

117
00:06:09,750 --> 00:06:13,050
Rating column to see where the minimum

118
00:06:13,200 --> 00:06:15,840
rating is and what the maximum rating

119
00:06:15,840 --> 00:06:18,420
is, and the distribution of those

120
00:06:18,420 --> 00:06:21,930
ratings, and have them as a graph here

121
00:06:21,930 --> 00:06:24,570
displayed here. So that we can have a

122
00:06:24,570 --> 00:06:27,480
better understanding of our data. So I'm

123
00:06:27,480 --> 00:06:30,451
going to do Escape B, Enter and

124
00:06:30,453 --> 00:06:34,140
data.hist. So this will be a histogram

125
00:06:34,830 --> 00:06:38,460
with parentheses. We are interested to

126
00:06:38,460 --> 00:06:41,619
see the distribution of the ratings.

127
00:06:42,382 --> 00:06:45,992
Therefore I enter the Rating column here, I say string.

128
00:06:45,993 --> 00:06:48,337
[No audio]

129
00:06:48,339 --> 00:06:53,413
Execute, and we get this graph. Let me explain

130
00:06:53,415 --> 00:06:56,130
you, what that graph means. What this means is

131
00:06:56,130 --> 00:06:57,870
that, for example, let's start from the

132
00:06:57,870 --> 00:07:01,980
right. This bar here, this first bar

133
00:07:03,141 --> 00:07:08,782
means that we have around 24,000

134
00:07:09,746 --> 00:07:14,490
5 star reviews ratings. You see, for example,

135
00:07:14,490 --> 00:07:18,000
we have this, this is a 5.0 star rating.

136
00:07:18,180 --> 00:07:22,980
So we have around 24,000, 5 star ratings in

137
00:07:22,980 --> 00:07:26,455
the whole DataFrame. And in total, we have

138
00:07:26,742 --> 00:07:33,156
45,000 of those in total. Then the next this bar here,

139
00:07:33,157 --> 00:07:35,282
[No audio]

140
00:07:35,284 --> 00:07:41,930
these are 4.5 star ratings like that one in there.

141
00:07:43,871 --> 00:07:47,970
And we have around, let's say 7000s of

142
00:07:47,970 --> 00:07:51,510
those, then we have 4 star ratings,

143
00:07:52,470 --> 00:07:58,890
around 9000s, perhaps 3.5 star ratings

144
00:07:58,920 --> 00:08:03,420
in here. This is 3 star ratings.

145
00:08:03,840 --> 00:08:08,370
It's about 2000. Then we have 2.5 star

146
00:08:08,370 --> 00:08:11,700
ratings, this one here, 2 star

147
00:08:11,700 --> 00:08:17,610
ratings, 1.5 star ratings. That's the

148
00:08:17,610 --> 00:08:20,922
lowest number among all the ratings.

149
00:08:21,672 --> 00:08:24,690
So people don't leave a lot of 1.5 star

150
00:08:24,690 --> 00:08:27,690
ratings. And we also have this 1 star

151
00:08:27,750 --> 00:08:30,270
ratings in here. It's not the best

152
00:08:30,270 --> 00:08:32,670
graph. I don't like it personally, but

153
00:08:32,670 --> 00:08:34,620
it's a quick way. So that's you know,

154
00:08:34,620 --> 00:08:37,710
how your data are distributed. You know

155
00:08:37,710 --> 00:08:40,740
that's okay, we have data from 1 to 5

156
00:08:40,740 --> 00:08:45,000
and 5 is the most occurring value in

157
00:08:45,000 --> 00:08:47,790
your data. And that's about this

158
00:08:47,790 --> 00:08:50,280
lecture we saw, you can get an overview

159
00:08:50,310 --> 00:08:53,280
of your DataFrame. In the next lecture,

160
00:08:53,280 --> 00:08:56,929
we are going to zoom in into our DataFrame to actually

161
00:08:56,931 --> 00:09:00,914
we would be able to select particular rows, or particular

162
00:09:00,915 --> 00:09:05,490
slices of our DataFrame to see individual values.

163
00:09:05,700 --> 00:09:07,980
So in other words, we are going to use

164
00:09:07,980 --> 00:09:10,980
Python to navigate through our data and

165
00:09:10,980 --> 00:09:15,240
select particular sections of the data and

166
00:09:15,242 --> 00:09:17,931
display them. So I'll see you in the next video.