1 00:00:00,000 --> 00:00:02,698 Hi, welcome back. When you're doing data 2 00:00:02,700 --> 00:00:05,190 analysis, the first thing you want to do 3 00:00:05,190 --> 00:00:07,890 is you want to familiarize yourself with 4 00:00:07,890 --> 00:00:10,770 your data using the tool, the data 5 00:00:10,770 --> 00:00:13,170 analysis tool that you have chosen to 6 00:00:13,170 --> 00:00:15,690 use for your particular project. In this 7 00:00:15,690 --> 00:00:18,840 case, the tool is Python. So what we're 8 00:00:18,840 --> 00:00:20,640 going to do in this video is we are 9 00:00:20,640 --> 00:00:24,030 going to load the data into Python using 10 00:00:24,052 --> 00:00:26,632 a Jupyter Notebook. And then look 11 00:00:26,670 --> 00:00:29,250 at those data and extract some very 12 00:00:29,250 --> 00:00:32,610 basic information about the data, such 13 00:00:32,635 --> 00:00:35,065 as looking at what the column names we have, 14 00:00:35,370 --> 00:00:38,310 how many rows we have, and other simple 15 00:00:38,430 --> 00:00:41,700 attributes of our data. So let's do 16 00:00:41,700 --> 00:00:43,950 that. I'd like to let you know that I 17 00:00:43,950 --> 00:00:46,390 created an empty folder named 18 00:00:46,391 --> 00:00:50,580 review_analysis, and I put the 19 00:00:50,610 --> 00:00:54,090 reviews.csv file in that folder. 20 00:00:54,870 --> 00:00:57,360 You can find this reviews.csv 21 00:00:57,360 --> 00:00:59,940 file attached to this lecture so as the 22 00:00:59,940 --> 00:01:01,860 lecture resource, so please download 23 00:01:01,890 --> 00:01:04,500 that and place it in a folder just like 24 00:01:04,500 --> 00:01:07,440 I did. And then you should be able to 25 00:01:07,440 --> 00:01:10,770 locate that folder from the Jupyter 26 00:01:10,927 --> 00:01:16,036 homepage, which is localhost:8888/tree. 27 00:01:16,319 --> 00:01:18,900 So I created that folder directly 28 00:01:19,110 --> 00:01:21,900 in my users folder. So this is the 29 00:01:21,900 --> 00:01:24,311 folder, I can click that and here is 30 00:01:24,313 --> 00:01:27,487 reviews.csv file. In your case, 31 00:01:27,488 --> 00:01:29,945 you can create this wherever you want, and then 32 00:01:29,946 --> 00:01:32,472 locate it using this directory tree here. 33 00:01:32,473 --> 00:01:34,927 [No audio] 34 00:01:34,929 --> 00:01:39,257 While I am in this folder from Jupyter, I can go to 35 00:01:39,258 --> 00:01:44,130 the New dropdown list and go to Python 3 to 36 00:01:44,130 --> 00:01:47,520 create a new Jupyter Notebook. So the 37 00:01:47,520 --> 00:01:49,890 Jupyter Notebook has been created, I can 38 00:01:49,890 --> 00:01:54,000 rename this to something else, the name 39 00:01:54,030 --> 00:01:56,310 of the Jupyter Notebook. Let's keep 40 00:01:56,310 --> 00:01:59,940 it simple and say reviews, Rename and 41 00:01:59,940 --> 00:02:05,070 that will get a .ipynb extension. 42 00:02:05,640 --> 00:02:08,100 Let me make some more room here. So that 43 00:02:08,100 --> 00:02:11,220 you see more code. I'm going to toggle 44 00:02:11,220 --> 00:02:13,560 the header off, and now let's start 45 00:02:13,560 --> 00:02:16,710 coding. The very first thing we want to 46 00:02:16,710 --> 00:02:20,340 do, of course, is import pandas, the 47 00:02:20,340 --> 00:02:22,650 library that is used to perform data 48 00:02:22,650 --> 00:02:25,530 analysis with Python, and then we want 49 00:02:25,530 --> 00:02:27,660 to create a variable which is going to 50 00:02:27,660 --> 00:02:30,179 hold the data, that is equal to 51 00:02:30,181 --> 00:02:33,330 pandas.read, since we are working with a 52 00:02:33,355 --> 00:02:36,595 csv file, the method we want to use 53 00:02:37,170 --> 00:02:41,430 out of the pandas library is read_csv. 54 00:02:42,150 --> 00:02:44,310 In parentheses, we want to pass in 55 00:02:44,310 --> 00:02:47,130 single quotes or double quotes, whatever 56 00:02:47,130 --> 00:02:51,330 you like, the path to the csv file. Now 57 00:02:51,330 --> 00:02:54,720 if you don't type in the name correctly, 58 00:02:54,720 --> 00:02:58,410 you're going to get an error. So let me 59 00:02:58,410 --> 00:03:02,760 try to execute this cell using Control 60 00:03:02,760 --> 00:03:04,920 Enter if you are on Windows, or Command 61 00:03:04,957 --> 00:03:09,127 Enter if you are on Mac, and as I warned you 62 00:03:09,152 --> 00:03:11,850 I got an error, it says No such file 63 00:03:11,850 --> 00:03:14,880 or directory, because I mistyped the 64 00:03:14,880 --> 00:03:18,840 file. So reviews. This time if I 65 00:03:18,840 --> 00:03:20,970 execute, I don't get an error, that 66 00:03:20,970 --> 00:03:25,110 means data was loaded successfully. And 67 00:03:25,110 --> 00:03:31,870 I can press Escape B Enter and call the data variable 68 00:03:32,213 --> 00:03:35,653 and Control Enter again to see the DataFrame. 69 00:03:35,654 --> 00:03:39,412 [No audio] 70 00:03:39,414 --> 00:03:44,079 So we can see that we have 1, 2, 3, 4 columns, and 71 00:03:44,080 --> 00:03:47,940 we also have this index column here added by pandas 72 00:03:47,940 --> 00:03:51,150 automatically. Basically, its a range 73 00:03:51,180 --> 00:03:53,640 of numbers starting from 0. So that 74 00:03:53,640 --> 00:03:56,250 is the first row of our data. This 75 00:03:56,250 --> 00:04:00,180 one in here, that has this index of 0. 76 00:04:01,800 --> 00:04:04,950 And it ends at the last row, you can see 77 00:04:04,950 --> 00:04:07,050 that this is the first row, second, 78 00:04:07,110 --> 00:04:11,280 third, fourth, five row and then Jupyter is 79 00:04:11,280 --> 00:04:14,250 not displaying the rows after row 80 00:04:14,310 --> 00:04:17,430 five, because there are a lot of rows 81 00:04:17,430 --> 00:04:21,270 and it'd be impractical to see them 82 00:04:21,270 --> 00:04:23,730 here. However, you get to see the last 83 00:04:23,946 --> 00:04:29,406 five rows of the DataFrame. And that 84 00:04:29,538 --> 00:04:31,950 gives you an overview of the DataFrame. 85 00:04:32,490 --> 00:04:36,500 However, what I like to do instead is just printout 86 00:04:36,768 --> 00:04:39,624 the head of the DataFrame, which is the first 87 00:04:39,649 --> 00:04:41,681 [No audio] 88 00:04:41,683 --> 00:04:46,725 rows only. So the first five rows, that gives you 89 00:04:46,726 --> 00:04:49,530 a more compact view. It gives you an idea 90 00:04:49,530 --> 00:04:51,750 what columns you have and what kind of 91 00:04:51,990 --> 00:04:54,780 rows you have also. So it's good to have 92 00:04:54,780 --> 00:04:56,910 a head of the DataFrame displayed in here. 93 00:04:57,180 --> 00:05:00,810 Then we can press Escape B and create a 94 00:05:00,810 --> 00:05:04,322 new cell here, Enter to write some other code. 95 00:05:05,541 --> 00:05:08,586 You can get the shape of the DataFrame 96 00:05:09,258 --> 00:05:11,460 by accessing the shape property. 97 00:05:11,610 --> 00:05:16,290 So that is a method with parentheses, 98 00:05:16,980 --> 00:05:19,380 that is a property, it doesn't need 99 00:05:19,380 --> 00:05:21,660 parentheses, and then you get to the 100 00:05:21,660 --> 00:05:23,991 shape of the DataFrame which is basically the 101 00:05:23,992 --> 00:05:27,661 number of rows and the number of columns. 102 00:05:27,662 --> 00:05:29,842 [No audio] 103 00:05:29,844 --> 00:05:32,881 You see 1, 2, 3, 4 columns. 104 00:05:32,906 --> 00:05:34,916 [No audio] 105 00:05:34,940 --> 00:05:41,026 You might also want to B Enter, display 106 00:05:41,050 --> 00:05:43,050 [No audio] 107 00:05:43,053 --> 00:05:45,810 the columns that your DataFrame has. 108 00:05:46,200 --> 00:05:48,990 Even though we have them there, this is 109 00:05:48,990 --> 00:05:51,900 yet an other way to see the names of the 110 00:05:51,900 --> 00:05:54,120 columns by accessing the columns 111 00:05:54,150 --> 00:05:56,940 property. Then next, usually, when you 112 00:05:56,940 --> 00:05:59,430 are working with data, you have some 113 00:05:59,430 --> 00:06:01,830 specific columns that you are interested 114 00:06:01,830 --> 00:06:04,680 about, which could be one column or 115 00:06:04,680 --> 00:06:06,750 more. In this case, we might be 116 00:06:06,750 --> 00:06:09,750 interested to see an overview of the 117 00:06:09,750 --> 00:06:13,050 Rating column to see where the minimum 118 00:06:13,200 --> 00:06:15,840 rating is and what the maximum rating 119 00:06:15,840 --> 00:06:18,420 is, and the distribution of those 120 00:06:18,420 --> 00:06:21,930 ratings, and have them as a graph here 121 00:06:21,930 --> 00:06:24,570 displayed here. So that we can have a 122 00:06:24,570 --> 00:06:27,480 better understanding of our data. So I'm 123 00:06:27,480 --> 00:06:30,451 going to do Escape B, Enter and 124 00:06:30,453 --> 00:06:34,140 data.hist. So this will be a histogram 125 00:06:34,830 --> 00:06:38,460 with parentheses. We are interested to 126 00:06:38,460 --> 00:06:41,619 see the distribution of the ratings. 127 00:06:42,382 --> 00:06:45,992 Therefore I enter the Rating column here, I say string. 128 00:06:45,993 --> 00:06:48,337 [No audio] 129 00:06:48,339 --> 00:06:53,413 Execute, and we get this graph. Let me explain 130 00:06:53,415 --> 00:06:56,130 you, what that graph means. What this means is 131 00:06:56,130 --> 00:06:57,870 that, for example, let's start from the 132 00:06:57,870 --> 00:07:01,980 right. This bar here, this first bar 133 00:07:03,141 --> 00:07:08,782 means that we have around 24,000 134 00:07:09,746 --> 00:07:14,490 5 star reviews ratings. You see, for example, 135 00:07:14,490 --> 00:07:18,000 we have this, this is a 5.0 star rating. 136 00:07:18,180 --> 00:07:22,980 So we have around 24,000, 5 star ratings in 137 00:07:22,980 --> 00:07:26,455 the whole DataFrame. And in total, we have 138 00:07:26,742 --> 00:07:33,156 45,000 of those in total. Then the next this bar here, 139 00:07:33,157 --> 00:07:35,282 [No audio] 140 00:07:35,284 --> 00:07:41,930 these are 4.5 star ratings like that one in there. 141 00:07:43,871 --> 00:07:47,970 And we have around, let's say 7000s of 142 00:07:47,970 --> 00:07:51,510 those, then we have 4 star ratings, 143 00:07:52,470 --> 00:07:58,890 around 9000s, perhaps 3.5 star ratings 144 00:07:58,920 --> 00:08:03,420 in here. This is 3 star ratings. 145 00:08:03,840 --> 00:08:08,370 It's about 2000. Then we have 2.5 star 146 00:08:08,370 --> 00:08:11,700 ratings, this one here, 2 star 147 00:08:11,700 --> 00:08:17,610 ratings, 1.5 star ratings. That's the 148 00:08:17,610 --> 00:08:20,922 lowest number among all the ratings. 149 00:08:21,672 --> 00:08:24,690 So people don't leave a lot of 1.5 star 150 00:08:24,690 --> 00:08:27,690 ratings. And we also have this 1 star 151 00:08:27,750 --> 00:08:30,270 ratings in here. It's not the best 152 00:08:30,270 --> 00:08:32,670 graph. I don't like it personally, but 153 00:08:32,670 --> 00:08:34,620 it's a quick way. So that's you know, 154 00:08:34,620 --> 00:08:37,710 how your data are distributed. You know 155 00:08:37,710 --> 00:08:40,740 that's okay, we have data from 1 to 5 156 00:08:40,740 --> 00:08:45,000 and 5 is the most occurring value in 157 00:08:45,000 --> 00:08:47,790 your data. And that's about this 158 00:08:47,790 --> 00:08:50,280 lecture we saw, you can get an overview 159 00:08:50,310 --> 00:08:53,280 of your DataFrame. In the next lecture, 160 00:08:53,280 --> 00:08:56,929 we are going to zoom in into our DataFrame to actually 161 00:08:56,931 --> 00:09:00,914 we would be able to select particular rows, or particular 162 00:09:00,915 --> 00:09:05,490 slices of our DataFrame to see individual values. 163 00:09:05,700 --> 00:09:07,980 So in other words, we are going to use 164 00:09:07,980 --> 00:09:10,980 Python to navigate through our data and 165 00:09:10,980 --> 00:09:15,240 select particular sections of the data and 166 00:09:15,242 --> 00:09:17,931 display them. So I'll see you in the next video.