1
00:00:00,000 --> 00:00:02,249
Hi, in this video, I'm going to show you

2
00:00:02,250 --> 00:00:06,840
how to creates a plot of daily average

3
00:00:06,840 --> 00:00:09,870
ratings. So our graph will show the

4
00:00:09,870 --> 00:00:13,800
average rating for each day. To do that,

5
00:00:13,890 --> 00:00:16,560
I am not going to work on the previous

6
00:00:16,710 --> 00:00:19,992
Jupyter file, because I like to keep things separate.

7
00:00:19,993 --> 00:00:22,313
So for the visualization part, I'm going to create

8
00:00:22,314 --> 00:00:24,953
a new notebook, I would suggest you do the same.

9
00:00:24,954 --> 00:00:27,261
[No audio]

10
00:00:27,262 --> 00:00:33,421
So first of all, we need to load the DataFrame. So I'm

11
00:00:33,446 --> 00:00:36,187
just going to copy that cell from the previous notebook.

12
00:00:36,188 --> 00:00:38,464
[No audio]

13
00:00:38,489 --> 00:00:39,521
And copy there,

14
00:00:39,546 --> 00:00:41,546
[No audio]

15
00:00:41,571 --> 00:00:43,582
and just print quickly the head

16
00:00:43,583 --> 00:00:46,609
of the data just to double check everything is

17
00:00:46,610 --> 00:00:50,370
alright. Yes, so these are the data, the row data,

18
00:00:50,700 --> 00:00:52,770
then press Escape and press B on your

19
00:00:52,770 --> 00:00:54,720
keyboard to create a new cell, press

20
00:00:54,720 --> 00:00:56,790
Enter to enter the new cell and let

21
00:00:56,850 --> 00:01:00,330
create that graph. But before we create

22
00:01:00,330 --> 00:01:02,880
the graph, we need to do some sort of

23
00:01:02,910 --> 00:01:05,700
data aggregation, right. So we are

24
00:01:05,730 --> 00:01:08,130
turning data into information, but our

25
00:01:08,130 --> 00:01:11,730
data are quite a row, you see that. For

26
00:01:11,730 --> 00:01:15,480
one day, we have multiple ratings left

27
00:01:15,495 --> 00:01:18,585
by different students, and for example,

28
00:01:18,923 --> 00:01:22,493
on the 2nd of April, so 2nd of

29
00:01:22,530 --> 00:01:24,690
April, we have this rating, we have

30
00:01:24,690 --> 00:01:28,680
that, and that, and that, and so on. We

31
00:01:28,680 --> 00:01:31,380
want to aggregate those numbers into one

32
00:01:31,410 --> 00:01:35,010
average rating. How do we do that? We do

33
00:01:35,010 --> 00:01:38,520
that, we can do that by using the pandas

34
00:01:38,550 --> 00:01:42,300
groupby method, which produces a new

35
00:01:42,630 --> 00:01:44,700
DataFrame. So it's going to be a new

36
00:01:44,700 --> 00:01:48,344
DataFrame. But with aggregated data, so some new

37
00:01:48,345 --> 00:01:52,184
data which are averages of the row DataFrame.

38
00:01:53,000 --> 00:01:56,190
Let me show you how groupby works. So I

39
00:01:56,190 --> 00:02:00,984
said that groupby produces a new DataFrame, therefore I'm

40
00:02:00,985 --> 00:02:04,680
going to create a new variable where the new DataFrame is

41
00:02:04,680 --> 00:02:07,200
going to be stored. So day_average is my

42
00:02:07,200 --> 00:02:09,090
new variable and that would be equal to

43
00:02:09,090 --> 00:02:13,554
data, so that DataFrame, groupby.

44
00:02:13,555 --> 00:02:15,556
[No audio]

45
00:02:15,558 --> 00:02:17,944
I really like this method, its very intuitive.

46
00:02:17,945 --> 00:02:24,436
So you're grouping these data by, what do we

47
00:02:24,437 --> 00:02:28,454
want to groupby? Well, by Timestamp.

48
00:02:28,455 --> 00:02:31,289
[Author typing]

49
00:02:31,290 --> 00:02:33,360
Let's see what that gives us.

50
00:02:33,721 --> 00:02:38,010
So I'm going to print out day_average in

51
00:02:38,010 --> 00:02:42,360
here, a head of day_average, execute.

52
00:02:43,140 --> 00:02:44,477
And let's see what we got.

53
00:02:44,478 --> 00:02:47,821
[No audio]

54
00:02:47,822 --> 00:02:50,769
We got basically the same DataFrame.

55
00:02:51,226 --> 00:02:55,080
That is because groupby was not able to group

56
00:02:55,440 --> 00:02:58,380
the data by timestamp, because the way

57
00:02:58,380 --> 00:03:01,830
groupby works is that groupby tries to

58
00:03:01,830 --> 00:03:06,000
find identical values in that given

59
00:03:06,150 --> 00:03:09,660
column. In this case, timestamp doesn't

60
00:03:09,660 --> 00:03:12,840
have any identical values, because each

61
00:03:12,840 --> 00:03:15,000
value is unique, you can see it's the

62
00:03:15,000 --> 00:03:19,590
same day but this review was left this

63
00:03:19,740 --> 00:03:22,500
time, the other review was left this

64
00:03:22,500 --> 00:03:24,810
time, and so on, so each value is

65
00:03:24,810 --> 00:03:27,600
different. Therefore, before we apply

66
00:03:27,600 --> 00:03:29,970
groupby, we need to do some processing

67
00:03:29,970 --> 00:03:33,540
here. We need to add a new column in the

68
00:03:33,540 --> 00:03:37,650
data DataFrame. I'm going to name this

69
00:03:37,650 --> 00:03:41,730
column Day with capital D, I'm just

70
00:03:41,730 --> 00:03:44,220
trying to be consistent with the names

71
00:03:44,220 --> 00:03:46,620
of columns. Since these are with capital

72
00:03:47,220 --> 00:03:49,452
letters, they start with a capital letter, I'm going

73
00:03:49,453 --> 00:03:51,955
to create this new column with a capital letter.

74
00:03:51,979 --> 00:03:53,979
[No audio]

75
00:03:54,000 --> 00:03:59,788
So data Day equal to data Timestamp

76
00:03:59,813 --> 00:04:01,844
[No audio]

77
00:04:01,846 --> 00:04:08,562
.dt, dt is a property, which gives us access to a

78
00:04:08,580 --> 00:04:13,371
number of data time attributes such as date,

79
00:04:14,898 --> 00:04:17,746
sorry, date, you can do month,

80
00:04:17,770 --> 00:04:19,770
[No audio]

81
00:04:19,795 --> 00:04:23,550
and so on. For now we need the date. Let me

82
00:04:23,970 --> 00:04:25,920
comment this out. So I'm going to select

83
00:04:25,920 --> 00:04:29,678
them and press Ctrl / or Command /

84
00:04:29,905 --> 00:04:33,067
to comment them out. And I'm going to show you

85
00:04:34,602 --> 00:04:39,566
the new DataFrame. So that is the data DataFrame.

86
00:04:39,591 --> 00:04:41,621
[No audio]

87
00:04:41,646 --> 00:04:44,906
So what I just did is that, I extracted from

88
00:04:44,907 --> 00:04:48,120
this timestamp, I extracted the date only.

89
00:04:49,050 --> 00:04:51,150
Therefore what I got is that for each

90
00:04:51,150 --> 00:04:53,910
timestamp, I got the date. So for this

91
00:04:53,910 --> 00:04:57,780
timestamp is 2nd of April, 2nd of

92
00:04:57,780 --> 00:05:00,150
April, and so on. So this way we got

93
00:05:00,150 --> 00:05:03,270
some identical data. You know, if you

94
00:05:03,270 --> 00:05:08,250
want it month, you'd get the number of

95
00:05:08,250 --> 00:05:13,620
the month. So 4, 4, 4, and so on, but we

96
00:05:13,620 --> 00:05:16,179
need date. So I'm going to keep date there.

97
00:05:16,838 --> 00:05:21,929
Now we can, uncomment these, and let's

98
00:05:21,954 --> 00:05:24,450
delete the data.head because we don't need it

99
00:05:24,450 --> 00:05:29,294
anymore. And now, we can try this groupby

100
00:05:29,296 --> 00:05:31,770
method again. But be careful, this

101
00:05:31,770 --> 00:05:34,288
time we need Day here, as we said, we want to

102
00:05:34,289 --> 00:05:38,094
groupby Day. Let's execute and see what we get.

103
00:05:38,095 --> 00:05:42,557
[No audio]

104
00:05:42,558 --> 00:05:45,694
Maybe it's not the result we were waiting for.

105
00:05:45,695 --> 00:05:50,070
So you see what, day is not yet aggregated. Because

106
00:05:50,070 --> 00:05:52,890
we need to give another command here, we

107
00:05:52,890 --> 00:05:56,730
need to tell Pandas, the method of

108
00:05:56,730 --> 00:06:01,020
aggregation. So do you want to aggregate

109
00:06:01,500 --> 00:06:05,550
based on the mean or the count, in this

110
00:06:05,550 --> 00:06:10,165
case, is the mean. So we'd say .mean the method.

111
00:06:10,754 --> 00:06:14,652
Execute, and to this time, this is what we got.

112
00:06:14,653 --> 00:06:16,677
[No audio]

113
00:06:16,679 --> 00:06:19,467
So that's just the heads. But if you print out

114
00:06:19,468 --> 00:06:22,290
the entire DataFrame, you see that it

115
00:06:22,290 --> 00:06:26,730
goes like that upto this date. So for

116
00:06:26,730 --> 00:06:29,880
each row, we have one day 1st of

117
00:06:29,880 --> 00:06:32,520
January, 2nd of January, 3rd of January,

118
00:06:32,520 --> 00:06:35,130
and so on. So that is the average

119
00:06:35,130 --> 00:06:38,760
rating of all the courses for that day.

120
00:06:38,761 --> 00:06:40,798
[No audio]

121
00:06:40,800 --> 00:06:43,650
Now, you need to understand this

122
00:06:43,740 --> 00:06:47,670
product. This is, as I told you, it's a

123
00:06:47,670 --> 00:06:53,241
DataFrame type, so Pandas DataFrame, but

124
00:06:53,242 --> 00:06:55,938
[No audio]

125
00:06:55,940 --> 00:07:02,015
this has one column. So that is not a column,

126
00:07:02,016 --> 00:07:05,152
Day is not a column, Day is actually the index,

127
00:07:05,153 --> 00:07:08,838
you see that if you say day_average.columns,

128
00:07:08,839 --> 00:07:11,422
[No audio]

129
00:07:11,424 --> 00:07:14,514
Rating is the only column. And if you

130
00:07:14,515 --> 00:07:16,560
want to access rating, you do like that.

131
00:07:18,120 --> 00:07:21,360
Right, and you get this series. Now,

132
00:07:21,360 --> 00:07:23,430
what is this? This is the index.

133
00:07:24,450 --> 00:07:26,250
Therefore, if you want to access the

134
00:07:26,430 --> 00:07:28,920
Day column, you don't do it like that,

135
00:07:29,220 --> 00:07:31,170
because that is the syntax to access

136
00:07:31,170 --> 00:07:35,359
columns. When you want to access that special column,

137
00:07:35,360 --> 00:07:38,585
so that index column, you want to say .index.

138
00:07:38,586 --> 00:07:40,649
[No audio]

139
00:07:40,651 --> 00:07:45,202
And then we get this series which is

140
00:07:45,203 --> 00:07:48,635
[No audio]

141
00:07:48,636 --> 00:07:53,550
type index, but you can easily convert it into a list.

142
00:07:53,550 --> 00:07:56,640
For example, if you like to plot them,

143
00:07:57,780 --> 00:08:00,720
so just like columns, indexes, such as

144
00:08:00,720 --> 00:08:04,410
this one are also lists like types. So

145
00:08:04,410 --> 00:08:07,590
arrays of data. Let's now do the

146
00:08:07,590 --> 00:08:10,620
plotting. To do the plotting, we are

147
00:08:10,620 --> 00:08:15,000
going to need the Matplotlib library, so

148
00:08:15,000 --> 00:08:18,246
I'm going to import it in here.

149
00:08:19,874 --> 00:08:26,040
So import matplotlib.pyplot, we

150
00:08:26,040 --> 00:08:30,570
need that module of the library. And a

151
00:08:30,570 --> 00:08:34,590
good practice is to use as plt. So you'll

152
00:08:34,590 --> 00:08:36,720
see on the web that everyone uses plt.

153
00:08:36,810 --> 00:08:38,520
So if you want to be consistent with

154
00:08:38,520 --> 00:08:41,280
other programmers, you want to import

155
00:08:41,280 --> 00:08:45,000
that as plt. That also makes your job

156
00:08:45,030 --> 00:08:47,340
easier because you don't have to type

157
00:08:47,340 --> 00:08:51,270
that down, but you can just say plt as

158
00:08:51,270 --> 00:08:56,550
we will do here. So plt.plot is the

159
00:08:56,550 --> 00:09:00,270
method and this methods basically gets

160
00:09:00,690 --> 00:09:04,650
two arguments, the x and the y. So, we

161
00:09:04,650 --> 00:09:07,410
are building a graph with an x and a y

162
00:09:07,798 --> 00:09:11,158
axis. Along the x axis we are going to

163
00:09:11,160 --> 00:09:15,336
have the days that means we want

164
00:09:15,750 --> 00:09:20,640
day_average.index. So that array there.

165
00:09:20,641 --> 00:09:23,359
[No audio]

166
00:09:23,366 --> 00:09:28,175
And along y we want day_average

167
00:09:30,001 --> 00:09:33,593
Rating column. Execute.

168
00:09:33,594 --> 00:09:36,673
[No audio]

169
00:09:36,675 --> 00:09:39,945
I got an error plt is not defined because I forgot to

170
00:09:39,946 --> 00:09:44,640
execute this cell. So that the input is valid.

171
00:09:44,782 --> 00:09:47,542
Now I can execute this again. And this

172
00:09:47,550 --> 00:09:51,060
is a product. So along the y axis we

173
00:09:51,060 --> 00:09:54,210
have dates, which I know they are a bit

174
00:09:54,646 --> 00:09:57,030
unvisible but we are going to fix that.

175
00:09:57,810 --> 00:10:01,260
And along the y axis we have the Ratings

176
00:10:01,260 --> 00:10:03,660
column. And so you'll see that it starts

177
00:10:03,660 --> 00:10:08,010
from 3.8, somewhere here, up to 5.

178
00:10:09,330 --> 00:10:12,120
Now Matplotlib picks that range

179
00:10:12,120 --> 00:10:15,630
automatically by looking at the data. So

180
00:10:15,630 --> 00:10:19,151
our Rating column, if you take a look

181
00:10:19,153 --> 00:10:24,630
on it, you see Day_average Rating. If

182
00:10:24,630 --> 00:10:27,540
you extract the max, the maximum value,

183
00:10:27,540 --> 00:10:31,620
you'll see that, it is 5.0. And if you

184
00:10:31,620 --> 00:10:34,110
see the minimum, you see that it's

185
00:10:34,740 --> 00:10:39,600
around 3.8. So one day, students

186
00:10:39,600 --> 00:10:42,780
left a 5.0 average rating, so all

187
00:10:42,780 --> 00:10:45,750
of them, and another day they left 3.8.

188
00:10:45,750 --> 00:10:48,300
So Matplotlib is putting those two

189
00:10:49,200 --> 00:10:52,890
as the limits of y axis, instead of

190
00:10:52,890 --> 00:10:55,827
starting the axis from 0 up to 5,

191
00:10:56,310 --> 00:11:00,330
that would make the plot less readable.

192
00:11:00,810 --> 00:11:02,670
So that's a good thing of

193
00:11:02,670 --> 00:11:05,580
matplotlib. The bad thing is, as you can

194
00:11:05,580 --> 00:11:08,850
see, these graph is not interactive. So

195
00:11:08,850 --> 00:11:12,690
it just an image file, you can not have

196
00:11:12,960 --> 00:11:15,210
popup capabilities. So you could see

197
00:11:15,210 --> 00:11:17,370
some values if you hover your

198
00:11:17,490 --> 00:11:21,240
mouse somewhere, so matplotlib cannot do

199
00:11:21,240 --> 00:11:25,062
that. However, we can improve this a little bit

200
00:11:25,581 --> 00:11:31,770
by declaring a figure object

201
00:11:32,100 --> 00:11:36,360
and give it a figsize argument of,

202
00:11:36,810 --> 00:11:41,250
let's say 25, 3 that is the width,

203
00:11:42,030 --> 00:11:46,710
and that is the height over the plot. So

204
00:11:46,710 --> 00:11:50,220
now you can see that we have a longer x

205
00:11:50,400 --> 00:11:54,480
axis, and the shorter y axis. Now, if

206
00:11:54,480 --> 00:11:57,810
you don't agree with this graph, if you

207
00:11:57,810 --> 00:11:59,700
think that this is still not readable,

208
00:11:59,700 --> 00:12:02,010
it doesn't tell you much about the

209
00:12:02,010 --> 00:12:04,800
trend. So either rating has been

210
00:12:04,800 --> 00:12:10,890
increasing with time or not. Then what

211
00:12:10,890 --> 00:12:13,470
we could do is we could down sample the

212
00:12:13,470 --> 00:12:17,100
data. So instead of extracting daily

213
00:12:17,100 --> 00:12:19,470
averages, we could extract weekly

214
00:12:19,470 --> 00:12:22,050
averages, and therefore we would have

215
00:12:22,050 --> 00:12:25,950
less points along the x axis and the

216
00:12:25,973 --> 00:12:29,633
smoother line. So in my opinion,

217
00:12:29,640 --> 00:12:31,740
these are too much data, it's a

218
00:12:31,740 --> 00:12:34,107
reasonable, it's not useful. Let's

219
00:12:34,109 --> 00:12:36,720
down sample them with a better graph in the

220
00:12:36,720 --> 00:12:39,780
next video. So this is what you learn

221
00:12:39,780 --> 00:12:42,330
in this video, you'll learn how to group

222
00:12:42,330 --> 00:12:44,280
data, and you'll learn how to plot them.

223
00:12:44,580 --> 00:12:47,760
Now let me make a small revision of the

224
00:12:47,760 --> 00:12:50,730
groupby methods, in case you are still

225
00:12:50,730 --> 00:12:53,400
confused. So let me remove that in

226
00:12:53,490 --> 00:12:57,720
another cell here. And let me print out

227
00:12:57,721 --> 00:13:00,178
[No audio]

228
00:13:00,180 --> 00:13:03,900
the head of the aggregated DataFrame. So

229
00:13:03,900 --> 00:13:05,880
you see we have rating and we have Day

230
00:13:05,880 --> 00:13:09,930
index here. What happens with Course Name

231
00:13:09,960 --> 00:13:11,610
with this column, what's happened with

232
00:13:11,610 --> 00:13:13,650
the Comments column, what's happened

233
00:13:13,650 --> 00:13:16,170
with the Timestamp column? Well, they

234
00:13:16,170 --> 00:13:19,500
disappeared because this mean method

235
00:13:20,460 --> 00:13:24,900
only works with columns that have number

236
00:13:24,900 --> 00:13:28,260
values, such as the rating, so it cannot

237
00:13:28,440 --> 00:13:31,380
calculate an average on that column

238
00:13:31,710 --> 00:13:34,020
Comments, or Course Name or Timestamp.

239
00:13:35,250 --> 00:13:37,800
Therefore, columns such as Comments,

240
00:13:37,830 --> 00:13:39,780
Timestamp, and Course Names will be

241
00:13:39,810 --> 00:13:43,200
automatically dropped, deleted by this

242
00:13:43,200 --> 00:13:45,840
method, and only those columns will be

243
00:13:45,840 --> 00:13:50,460
kept. Similar, you could do a count

244
00:13:51,150 --> 00:13:54,990
instead of that, in that case, you'd get

245
00:13:55,020 --> 00:13:58,380
a different DataFrame. So you see that

246
00:13:58,590 --> 00:14:02,253
we have account of Course Names, which means that

247
00:14:03,100 --> 00:14:08,640
46 rows have been created for

248
00:14:08,640 --> 00:14:11,370
that date. In other words, 46 reviews.

249
00:14:12,420 --> 00:14:14,880
So we have 46 ratings, whatever you

250
00:14:14,880 --> 00:14:17,190
like to call it. We have 7 here

251
00:14:17,190 --> 00:14:19,380
because what pandas does is that it

252
00:14:19,380 --> 00:14:22,740
doesn't take into account non values

253
00:14:22,740 --> 00:14:24,822
such as this one or that one, and so on.

254
00:14:26,424 --> 00:14:30,147
Therefore, we could have this count plot.

255
00:14:30,627 --> 00:14:35,410
So we show how many ratings we had each day.

256
00:14:35,411 --> 00:14:37,802
[No audio]

257
00:14:37,804 --> 00:14:40,742
And that is what I wanted to teach you in this video.

258
00:14:40,914 --> 00:14:43,371
Thanks a lot. I'll talk to you later.