1 00:00:00,000 --> 00:00:02,249 Hi, in this video, I'm going to show you 2 00:00:02,250 --> 00:00:06,840 how to creates a plot of daily average 3 00:00:06,840 --> 00:00:09,870 ratings. So our graph will show the 4 00:00:09,870 --> 00:00:13,800 average rating for each day. To do that, 5 00:00:13,890 --> 00:00:16,560 I am not going to work on the previous 6 00:00:16,710 --> 00:00:19,992 Jupyter file, because I like to keep things separate. 7 00:00:19,993 --> 00:00:22,313 So for the visualization part, I'm going to create 8 00:00:22,314 --> 00:00:24,953 a new notebook, I would suggest you do the same. 9 00:00:24,954 --> 00:00:27,261 [No audio] 10 00:00:27,262 --> 00:00:33,421 So first of all, we need to load the DataFrame. So I'm 11 00:00:33,446 --> 00:00:36,187 just going to copy that cell from the previous notebook. 12 00:00:36,188 --> 00:00:38,464 [No audio] 13 00:00:38,489 --> 00:00:39,521 And copy there, 14 00:00:39,546 --> 00:00:41,546 [No audio] 15 00:00:41,571 --> 00:00:43,582 and just print quickly the head 16 00:00:43,583 --> 00:00:46,609 of the data just to double check everything is 17 00:00:46,610 --> 00:00:50,370 alright. Yes, so these are the data, the row data, 18 00:00:50,700 --> 00:00:52,770 then press Escape and press B on your 19 00:00:52,770 --> 00:00:54,720 keyboard to create a new cell, press 20 00:00:54,720 --> 00:00:56,790 Enter to enter the new cell and let 21 00:00:56,850 --> 00:01:00,330 create that graph. But before we create 22 00:01:00,330 --> 00:01:02,880 the graph, we need to do some sort of 23 00:01:02,910 --> 00:01:05,700 data aggregation, right. So we are 24 00:01:05,730 --> 00:01:08,130 turning data into information, but our 25 00:01:08,130 --> 00:01:11,730 data are quite a row, you see that. For 26 00:01:11,730 --> 00:01:15,480 one day, we have multiple ratings left 27 00:01:15,495 --> 00:01:18,585 by different students, and for example, 28 00:01:18,923 --> 00:01:22,493 on the 2nd of April, so 2nd of 29 00:01:22,530 --> 00:01:24,690 April, we have this rating, we have 30 00:01:24,690 --> 00:01:28,680 that, and that, and that, and so on. We 31 00:01:28,680 --> 00:01:31,380 want to aggregate those numbers into one 32 00:01:31,410 --> 00:01:35,010 average rating. How do we do that? We do 33 00:01:35,010 --> 00:01:38,520 that, we can do that by using the pandas 34 00:01:38,550 --> 00:01:42,300 groupby method, which produces a new 35 00:01:42,630 --> 00:01:44,700 DataFrame. So it's going to be a new 36 00:01:44,700 --> 00:01:48,344 DataFrame. But with aggregated data, so some new 37 00:01:48,345 --> 00:01:52,184 data which are averages of the row DataFrame. 38 00:01:53,000 --> 00:01:56,190 Let me show you how groupby works. So I 39 00:01:56,190 --> 00:02:00,984 said that groupby produces a new DataFrame, therefore I'm 40 00:02:00,985 --> 00:02:04,680 going to create a new variable where the new DataFrame is 41 00:02:04,680 --> 00:02:07,200 going to be stored. So day_average is my 42 00:02:07,200 --> 00:02:09,090 new variable and that would be equal to 43 00:02:09,090 --> 00:02:13,554 data, so that DataFrame, groupby. 44 00:02:13,555 --> 00:02:15,556 [No audio] 45 00:02:15,558 --> 00:02:17,944 I really like this method, its very intuitive. 46 00:02:17,945 --> 00:02:24,436 So you're grouping these data by, what do we 47 00:02:24,437 --> 00:02:28,454 want to groupby? Well, by Timestamp. 48 00:02:28,455 --> 00:02:31,289 [Author typing] 49 00:02:31,290 --> 00:02:33,360 Let's see what that gives us. 50 00:02:33,721 --> 00:02:38,010 So I'm going to print out day_average in 51 00:02:38,010 --> 00:02:42,360 here, a head of day_average, execute. 52 00:02:43,140 --> 00:02:44,477 And let's see what we got. 53 00:02:44,478 --> 00:02:47,821 [No audio] 54 00:02:47,822 --> 00:02:50,769 We got basically the same DataFrame. 55 00:02:51,226 --> 00:02:55,080 That is because groupby was not able to group 56 00:02:55,440 --> 00:02:58,380 the data by timestamp, because the way 57 00:02:58,380 --> 00:03:01,830 groupby works is that groupby tries to 58 00:03:01,830 --> 00:03:06,000 find identical values in that given 59 00:03:06,150 --> 00:03:09,660 column. In this case, timestamp doesn't 60 00:03:09,660 --> 00:03:12,840 have any identical values, because each 61 00:03:12,840 --> 00:03:15,000 value is unique, you can see it's the 62 00:03:15,000 --> 00:03:19,590 same day but this review was left this 63 00:03:19,740 --> 00:03:22,500 time, the other review was left this 64 00:03:22,500 --> 00:03:24,810 time, and so on, so each value is 65 00:03:24,810 --> 00:03:27,600 different. Therefore, before we apply 66 00:03:27,600 --> 00:03:29,970 groupby, we need to do some processing 67 00:03:29,970 --> 00:03:33,540 here. We need to add a new column in the 68 00:03:33,540 --> 00:03:37,650 data DataFrame. I'm going to name this 69 00:03:37,650 --> 00:03:41,730 column Day with capital D, I'm just 70 00:03:41,730 --> 00:03:44,220 trying to be consistent with the names 71 00:03:44,220 --> 00:03:46,620 of columns. Since these are with capital 72 00:03:47,220 --> 00:03:49,452 letters, they start with a capital letter, I'm going 73 00:03:49,453 --> 00:03:51,955 to create this new column with a capital letter. 74 00:03:51,979 --> 00:03:53,979 [No audio] 75 00:03:54,000 --> 00:03:59,788 So data Day equal to data Timestamp 76 00:03:59,813 --> 00:04:01,844 [No audio] 77 00:04:01,846 --> 00:04:08,562 .dt, dt is a property, which gives us access to a 78 00:04:08,580 --> 00:04:13,371 number of data time attributes such as date, 79 00:04:14,898 --> 00:04:17,746 sorry, date, you can do month, 80 00:04:17,770 --> 00:04:19,770 [No audio] 81 00:04:19,795 --> 00:04:23,550 and so on. For now we need the date. Let me 82 00:04:23,970 --> 00:04:25,920 comment this out. So I'm going to select 83 00:04:25,920 --> 00:04:29,678 them and press Ctrl / or Command / 84 00:04:29,905 --> 00:04:33,067 to comment them out. And I'm going to show you 85 00:04:34,602 --> 00:04:39,566 the new DataFrame. So that is the data DataFrame. 86 00:04:39,591 --> 00:04:41,621 [No audio] 87 00:04:41,646 --> 00:04:44,906 So what I just did is that, I extracted from 88 00:04:44,907 --> 00:04:48,120 this timestamp, I extracted the date only. 89 00:04:49,050 --> 00:04:51,150 Therefore what I got is that for each 90 00:04:51,150 --> 00:04:53,910 timestamp, I got the date. So for this 91 00:04:53,910 --> 00:04:57,780 timestamp is 2nd of April, 2nd of 92 00:04:57,780 --> 00:05:00,150 April, and so on. So this way we got 93 00:05:00,150 --> 00:05:03,270 some identical data. You know, if you 94 00:05:03,270 --> 00:05:08,250 want it month, you'd get the number of 95 00:05:08,250 --> 00:05:13,620 the month. So 4, 4, 4, and so on, but we 96 00:05:13,620 --> 00:05:16,179 need date. So I'm going to keep date there. 97 00:05:16,838 --> 00:05:21,929 Now we can, uncomment these, and let's 98 00:05:21,954 --> 00:05:24,450 delete the data.head because we don't need it 99 00:05:24,450 --> 00:05:29,294 anymore. And now, we can try this groupby 100 00:05:29,296 --> 00:05:31,770 method again. But be careful, this 101 00:05:31,770 --> 00:05:34,288 time we need Day here, as we said, we want to 102 00:05:34,289 --> 00:05:38,094 groupby Day. Let's execute and see what we get. 103 00:05:38,095 --> 00:05:42,557 [No audio] 104 00:05:42,558 --> 00:05:45,694 Maybe it's not the result we were waiting for. 105 00:05:45,695 --> 00:05:50,070 So you see what, day is not yet aggregated. Because 106 00:05:50,070 --> 00:05:52,890 we need to give another command here, we 107 00:05:52,890 --> 00:05:56,730 need to tell Pandas, the method of 108 00:05:56,730 --> 00:06:01,020 aggregation. So do you want to aggregate 109 00:06:01,500 --> 00:06:05,550 based on the mean or the count, in this 110 00:06:05,550 --> 00:06:10,165 case, is the mean. So we'd say .mean the method. 111 00:06:10,754 --> 00:06:14,652 Execute, and to this time, this is what we got. 112 00:06:14,653 --> 00:06:16,677 [No audio] 113 00:06:16,679 --> 00:06:19,467 So that's just the heads. But if you print out 114 00:06:19,468 --> 00:06:22,290 the entire DataFrame, you see that it 115 00:06:22,290 --> 00:06:26,730 goes like that upto this date. So for 116 00:06:26,730 --> 00:06:29,880 each row, we have one day 1st of 117 00:06:29,880 --> 00:06:32,520 January, 2nd of January, 3rd of January, 118 00:06:32,520 --> 00:06:35,130 and so on. So that is the average 119 00:06:35,130 --> 00:06:38,760 rating of all the courses for that day. 120 00:06:38,761 --> 00:06:40,798 [No audio] 121 00:06:40,800 --> 00:06:43,650 Now, you need to understand this 122 00:06:43,740 --> 00:06:47,670 product. This is, as I told you, it's a 123 00:06:47,670 --> 00:06:53,241 DataFrame type, so Pandas DataFrame, but 124 00:06:53,242 --> 00:06:55,938 [No audio] 125 00:06:55,940 --> 00:07:02,015 this has one column. So that is not a column, 126 00:07:02,016 --> 00:07:05,152 Day is not a column, Day is actually the index, 127 00:07:05,153 --> 00:07:08,838 you see that if you say day_average.columns, 128 00:07:08,839 --> 00:07:11,422 [No audio] 129 00:07:11,424 --> 00:07:14,514 Rating is the only column. And if you 130 00:07:14,515 --> 00:07:16,560 want to access rating, you do like that. 131 00:07:18,120 --> 00:07:21,360 Right, and you get this series. Now, 132 00:07:21,360 --> 00:07:23,430 what is this? This is the index. 133 00:07:24,450 --> 00:07:26,250 Therefore, if you want to access the 134 00:07:26,430 --> 00:07:28,920 Day column, you don't do it like that, 135 00:07:29,220 --> 00:07:31,170 because that is the syntax to access 136 00:07:31,170 --> 00:07:35,359 columns. When you want to access that special column, 137 00:07:35,360 --> 00:07:38,585 so that index column, you want to say .index. 138 00:07:38,586 --> 00:07:40,649 [No audio] 139 00:07:40,651 --> 00:07:45,202 And then we get this series which is 140 00:07:45,203 --> 00:07:48,635 [No audio] 141 00:07:48,636 --> 00:07:53,550 type index, but you can easily convert it into a list. 142 00:07:53,550 --> 00:07:56,640 For example, if you like to plot them, 143 00:07:57,780 --> 00:08:00,720 so just like columns, indexes, such as 144 00:08:00,720 --> 00:08:04,410 this one are also lists like types. So 145 00:08:04,410 --> 00:08:07,590 arrays of data. Let's now do the 146 00:08:07,590 --> 00:08:10,620 plotting. To do the plotting, we are 147 00:08:10,620 --> 00:08:15,000 going to need the Matplotlib library, so 148 00:08:15,000 --> 00:08:18,246 I'm going to import it in here. 149 00:08:19,874 --> 00:08:26,040 So import matplotlib.pyplot, we 150 00:08:26,040 --> 00:08:30,570 need that module of the library. And a 151 00:08:30,570 --> 00:08:34,590 good practice is to use as plt. So you'll 152 00:08:34,590 --> 00:08:36,720 see on the web that everyone uses plt. 153 00:08:36,810 --> 00:08:38,520 So if you want to be consistent with 154 00:08:38,520 --> 00:08:41,280 other programmers, you want to import 155 00:08:41,280 --> 00:08:45,000 that as plt. That also makes your job 156 00:08:45,030 --> 00:08:47,340 easier because you don't have to type 157 00:08:47,340 --> 00:08:51,270 that down, but you can just say plt as 158 00:08:51,270 --> 00:08:56,550 we will do here. So plt.plot is the 159 00:08:56,550 --> 00:09:00,270 method and this methods basically gets 160 00:09:00,690 --> 00:09:04,650 two arguments, the x and the y. So, we 161 00:09:04,650 --> 00:09:07,410 are building a graph with an x and a y 162 00:09:07,798 --> 00:09:11,158 axis. Along the x axis we are going to 163 00:09:11,160 --> 00:09:15,336 have the days that means we want 164 00:09:15,750 --> 00:09:20,640 day_average.index. So that array there. 165 00:09:20,641 --> 00:09:23,359 [No audio] 166 00:09:23,366 --> 00:09:28,175 And along y we want day_average 167 00:09:30,001 --> 00:09:33,593 Rating column. Execute. 168 00:09:33,594 --> 00:09:36,673 [No audio] 169 00:09:36,675 --> 00:09:39,945 I got an error plt is not defined because I forgot to 170 00:09:39,946 --> 00:09:44,640 execute this cell. So that the input is valid. 171 00:09:44,782 --> 00:09:47,542 Now I can execute this again. And this 172 00:09:47,550 --> 00:09:51,060 is a product. So along the y axis we 173 00:09:51,060 --> 00:09:54,210 have dates, which I know they are a bit 174 00:09:54,646 --> 00:09:57,030 unvisible but we are going to fix that. 175 00:09:57,810 --> 00:10:01,260 And along the y axis we have the Ratings 176 00:10:01,260 --> 00:10:03,660 column. And so you'll see that it starts 177 00:10:03,660 --> 00:10:08,010 from 3.8, somewhere here, up to 5. 178 00:10:09,330 --> 00:10:12,120 Now Matplotlib picks that range 179 00:10:12,120 --> 00:10:15,630 automatically by looking at the data. So 180 00:10:15,630 --> 00:10:19,151 our Rating column, if you take a look 181 00:10:19,153 --> 00:10:24,630 on it, you see Day_average Rating. If 182 00:10:24,630 --> 00:10:27,540 you extract the max, the maximum value, 183 00:10:27,540 --> 00:10:31,620 you'll see that, it is 5.0. And if you 184 00:10:31,620 --> 00:10:34,110 see the minimum, you see that it's 185 00:10:34,740 --> 00:10:39,600 around 3.8. So one day, students 186 00:10:39,600 --> 00:10:42,780 left a 5.0 average rating, so all 187 00:10:42,780 --> 00:10:45,750 of them, and another day they left 3.8. 188 00:10:45,750 --> 00:10:48,300 So Matplotlib is putting those two 189 00:10:49,200 --> 00:10:52,890 as the limits of y axis, instead of 190 00:10:52,890 --> 00:10:55,827 starting the axis from 0 up to 5, 191 00:10:56,310 --> 00:11:00,330 that would make the plot less readable. 192 00:11:00,810 --> 00:11:02,670 So that's a good thing of 193 00:11:02,670 --> 00:11:05,580 matplotlib. The bad thing is, as you can 194 00:11:05,580 --> 00:11:08,850 see, these graph is not interactive. So 195 00:11:08,850 --> 00:11:12,690 it just an image file, you can not have 196 00:11:12,960 --> 00:11:15,210 popup capabilities. So you could see 197 00:11:15,210 --> 00:11:17,370 some values if you hover your 198 00:11:17,490 --> 00:11:21,240 mouse somewhere, so matplotlib cannot do 199 00:11:21,240 --> 00:11:25,062 that. However, we can improve this a little bit 200 00:11:25,581 --> 00:11:31,770 by declaring a figure object 201 00:11:32,100 --> 00:11:36,360 and give it a figsize argument of, 202 00:11:36,810 --> 00:11:41,250 let's say 25, 3 that is the width, 203 00:11:42,030 --> 00:11:46,710 and that is the height over the plot. So 204 00:11:46,710 --> 00:11:50,220 now you can see that we have a longer x 205 00:11:50,400 --> 00:11:54,480 axis, and the shorter y axis. Now, if 206 00:11:54,480 --> 00:11:57,810 you don't agree with this graph, if you 207 00:11:57,810 --> 00:11:59,700 think that this is still not readable, 208 00:11:59,700 --> 00:12:02,010 it doesn't tell you much about the 209 00:12:02,010 --> 00:12:04,800 trend. So either rating has been 210 00:12:04,800 --> 00:12:10,890 increasing with time or not. Then what 211 00:12:10,890 --> 00:12:13,470 we could do is we could down sample the 212 00:12:13,470 --> 00:12:17,100 data. So instead of extracting daily 213 00:12:17,100 --> 00:12:19,470 averages, we could extract weekly 214 00:12:19,470 --> 00:12:22,050 averages, and therefore we would have 215 00:12:22,050 --> 00:12:25,950 less points along the x axis and the 216 00:12:25,973 --> 00:12:29,633 smoother line. So in my opinion, 217 00:12:29,640 --> 00:12:31,740 these are too much data, it's a 218 00:12:31,740 --> 00:12:34,107 reasonable, it's not useful. Let's 219 00:12:34,109 --> 00:12:36,720 down sample them with a better graph in the 220 00:12:36,720 --> 00:12:39,780 next video. So this is what you learn 221 00:12:39,780 --> 00:12:42,330 in this video, you'll learn how to group 222 00:12:42,330 --> 00:12:44,280 data, and you'll learn how to plot them. 223 00:12:44,580 --> 00:12:47,760 Now let me make a small revision of the 224 00:12:47,760 --> 00:12:50,730 groupby methods, in case you are still 225 00:12:50,730 --> 00:12:53,400 confused. So let me remove that in 226 00:12:53,490 --> 00:12:57,720 another cell here. And let me print out 227 00:12:57,721 --> 00:13:00,178 [No audio] 228 00:13:00,180 --> 00:13:03,900 the head of the aggregated DataFrame. So 229 00:13:03,900 --> 00:13:05,880 you see we have rating and we have Day 230 00:13:05,880 --> 00:13:09,930 index here. What happens with Course Name 231 00:13:09,960 --> 00:13:11,610 with this column, what's happened with 232 00:13:11,610 --> 00:13:13,650 the Comments column, what's happened 233 00:13:13,650 --> 00:13:16,170 with the Timestamp column? Well, they 234 00:13:16,170 --> 00:13:19,500 disappeared because this mean method 235 00:13:20,460 --> 00:13:24,900 only works with columns that have number 236 00:13:24,900 --> 00:13:28,260 values, such as the rating, so it cannot 237 00:13:28,440 --> 00:13:31,380 calculate an average on that column 238 00:13:31,710 --> 00:13:34,020 Comments, or Course Name or Timestamp. 239 00:13:35,250 --> 00:13:37,800 Therefore, columns such as Comments, 240 00:13:37,830 --> 00:13:39,780 Timestamp, and Course Names will be 241 00:13:39,810 --> 00:13:43,200 automatically dropped, deleted by this 242 00:13:43,200 --> 00:13:45,840 method, and only those columns will be 243 00:13:45,840 --> 00:13:50,460 kept. Similar, you could do a count 244 00:13:51,150 --> 00:13:54,990 instead of that, in that case, you'd get 245 00:13:55,020 --> 00:13:58,380 a different DataFrame. So you see that 246 00:13:58,590 --> 00:14:02,253 we have account of Course Names, which means that 247 00:14:03,100 --> 00:14:08,640 46 rows have been created for 248 00:14:08,640 --> 00:14:11,370 that date. In other words, 46 reviews. 249 00:14:12,420 --> 00:14:14,880 So we have 46 ratings, whatever you 250 00:14:14,880 --> 00:14:17,190 like to call it. We have 7 here 251 00:14:17,190 --> 00:14:19,380 because what pandas does is that it 252 00:14:19,380 --> 00:14:22,740 doesn't take into account non values 253 00:14:22,740 --> 00:14:24,822 such as this one or that one, and so on. 254 00:14:26,424 --> 00:14:30,147 Therefore, we could have this count plot. 255 00:14:30,627 --> 00:14:35,410 So we show how many ratings we had each day. 256 00:14:35,411 --> 00:14:37,802 [No audio] 257 00:14:37,804 --> 00:14:40,742 And that is what I wanted to teach you in this video. 258 00:14:40,914 --> 00:14:43,371 Thanks a lot. I'll talk to you later.