1 00:00:00,000 --> 00:00:02,700 Hi, welcome back. In this video, you are 2 00:00:02,700 --> 00:00:04,950 going to learn how to filter data from 3 00:00:04,950 --> 00:00:07,410 your DataFrame. By that I mean 4 00:00:07,620 --> 00:00:11,250 extracting particular groups of data 5 00:00:11,250 --> 00:00:13,650 based on different types of conditions. 6 00:00:13,770 --> 00:00:16,357 For example, let's say in our DataFrame, 7 00:00:16,751 --> 00:00:18,150 we have different courses, even 8 00:00:18,150 --> 00:00:19,620 though here you only see the Python 9 00:00:19,620 --> 00:00:21,810 Mega Course. Because we are only seeing the 10 00:00:21,810 --> 00:00:24,540 first five rows, we have other courses 11 00:00:24,540 --> 00:00:27,030 with ratings. So if you want it, for 12 00:00:27,030 --> 00:00:29,850 example, to know the average rating for 13 00:00:29,850 --> 00:00:33,300 a particular course only, then you'd 14 00:00:33,300 --> 00:00:36,360 have to filter the data by applying some 15 00:00:36,360 --> 00:00:39,539 conditions. So filtering allows you to do 16 00:00:39,540 --> 00:00:43,228 analysis on a particular group of your data. 17 00:00:45,087 --> 00:00:49,617 Let's do that. First, I'm going to assign a number 18 00:00:49,618 --> 00:00:54,570 to each of the main headings. So that was an overview, 19 00:00:54,750 --> 00:00:57,330 the second lecture was selecting. So 20 00:00:57,330 --> 00:01:00,832 that's number 2. This time we 21 00:01:00,833 --> 00:01:04,312 [No audio] 22 00:01:04,313 --> 00:01:10,230 are at number 3, filtering data based on 23 00:01:11,280 --> 00:01:14,790 conditions. Right, so let's apply one 24 00:01:14,790 --> 00:01:16,770 condition first. So I'm going to create 25 00:01:16,770 --> 00:01:20,760 a Markdown cell and say, One condition. 26 00:01:22,710 --> 00:01:25,740 Let's apply a condition which would give 27 00:01:25,770 --> 00:01:30,690 us the DataFrame where the ratings are 28 00:01:30,690 --> 00:01:36,030 greater than 4. To do that, you'd 29 00:01:36,030 --> 00:01:38,580 have to follow this syntax. Now first, 30 00:01:38,610 --> 00:01:41,910 make sure that you have your data 31 00:01:41,910 --> 00:01:45,450 variable in the namespace by executing 32 00:01:45,450 --> 00:01:48,750 this first cell. So Ctrl Enter to 33 00:01:48,750 --> 00:01:51,300 execute that cell. Now we are sure that 34 00:01:51,330 --> 00:01:54,480 data is in the namespace. So this 35 00:01:54,480 --> 00:01:57,720 is the syntax data square brackets, 36 00:01:57,750 --> 00:02:00,507 and inside here goes the condition. 37 00:02:01,956 --> 00:02:05,010 So we said that we want the DataFrame 38 00:02:05,010 --> 00:02:08,550 where the rating is greater than 4. 39 00:02:08,790 --> 00:02:11,820 Therefore, we say column Rating is 40 00:02:11,820 --> 00:02:15,030 greater, let's leave some space greater 41 00:02:15,030 --> 00:02:20,100 than 4. If we execute that, we get a 42 00:02:20,100 --> 00:02:23,040 DataFrame, and all these rows have a 43 00:02:23,040 --> 00:02:27,990 rating of 4.5 or more. So not 4, 44 00:02:27,990 --> 00:02:30,690 because it's greater, it's not greater 45 00:02:30,690 --> 00:02:33,720 or equal. If it was greater or equal, 46 00:02:33,840 --> 00:02:39,420 we'll also get the 4 ratings 4.0. But 47 00:02:39,420 --> 00:02:41,220 this is not the case here. So this is 48 00:02:41,220 --> 00:02:45,090 what we get. You see, for example, that 49 00:02:45,120 --> 00:02:48,060 rows with index 0, 1, 2, were not included 50 00:02:48,090 --> 00:02:51,476 in the DataFrame in the output DataFrame. 51 00:02:52,742 --> 00:02:54,900 And you can see the length of 52 00:02:54,900 --> 00:02:57,882 this using a len function at Python, 53 00:02:57,883 --> 00:03:04,440 len function. So we get 29,758 54 00:03:04,470 --> 00:03:07,800 rows that have a rating of greater than 55 00:03:07,800 --> 00:03:10,110 4. Another way, if you don't want to 56 00:03:10,110 --> 00:03:12,960 use a native Python function, another 57 00:03:12,960 --> 00:03:18,483 way is to use count. So that is the DataFrame 58 00:03:18,484 --> 00:03:21,270 object, and the DataFrame objects 59 00:03:21,270 --> 00:03:25,170 have a count method. And you get this 60 00:03:25,200 --> 00:03:28,110 output, which basically says that we 61 00:03:28,110 --> 00:03:30,630 have this amount of values, in Course 62 00:03:30,630 --> 00:03:32,790 Name this amount of values in Timestamp, 63 00:03:32,792 --> 00:03:34,632 these amounts in Rating and, 64 00:03:35,007 --> 00:03:38,880 and Comments. Because comments have this NaN 65 00:03:38,880 --> 00:03:41,910 values, that is basically a non existent 66 00:03:41,910 --> 00:03:45,150 value. It doesn't exist. Therefore, we 67 00:03:45,150 --> 00:03:50,610 have 4927 ratings with comments where 68 00:03:50,610 --> 00:03:53,460 people actually wrote something. So the 69 00:03:53,460 --> 00:03:56,580 rest of this minus that gives us the 70 00:03:56,580 --> 00:04:00,840 number of NaN value comments. Of 71 00:04:00,840 --> 00:04:03,390 course, you can apply different types of 72 00:04:03,660 --> 00:04:07,650 statistics to this now. For example, 73 00:04:07,941 --> 00:04:10,701 this is a DataFrame, right. How do we 74 00:04:10,950 --> 00:04:13,860 extract only the column? How do we select only 75 00:04:13,890 --> 00:04:17,580 the column Rating? Well, you could 76 00:04:18,270 --> 00:04:23,520 take a look at here, selecting a column. 77 00:04:24,150 --> 00:04:26,190 So this is how you select a column data 78 00:04:26,220 --> 00:04:30,180 Rating. Now, this is the DataFrame, 79 00:04:30,810 --> 00:04:33,660 therefore you apply Rating, and you 80 00:04:33,660 --> 00:04:37,109 get the ratings with a value of 4.5 or more. 81 00:04:37,581 --> 00:04:43,582 So to understand this, what we do is, let's say d2 82 00:04:43,583 --> 00:04:45,849 [No audio] 83 00:04:45,851 --> 00:04:51,186 is the DataFrame that we get to based on that 84 00:04:51,187 --> 00:04:53,761 condition, right. So it's a DataFrame, 85 00:04:54,230 --> 00:04:58,625 that DataFrame has higher rating column. So that's 86 00:04:58,626 --> 00:05:01,617 what we get. That's the same thing that we did 87 00:05:03,448 --> 00:05:07,499 in here, data Rating, d2 Rating. Okay, 88 00:05:07,500 --> 00:05:10,620 I hope that's clear. That also allows you to 89 00:05:10,622 --> 00:05:13,260 apply functions such as mean. So you get 90 00:05:13,260 --> 00:05:17,970 the mean of the ratings greater than 91 00:05:17,970 --> 00:05:21,914 4, right. I'm going to leave it as it was. 92 00:05:23,230 --> 00:05:25,905 And then next is 93 00:05:25,906 --> 00:05:27,938 [No audio] 94 00:05:27,940 --> 00:05:31,276 Multiple conditions. 95 00:05:31,277 --> 00:05:33,395 [No audio] 96 00:05:33,397 --> 00:05:35,820 Let's say we want to filter 97 00:05:35,820 --> 00:05:38,430 the DataFrame where the rating is 98 00:05:38,430 --> 00:05:41,220 greater than 4, and the Course Name 99 00:05:41,250 --> 00:05:46,890 is equal to the complete Python Course, 100 00:05:47,460 --> 00:05:50,760 Built 10 Professional OOP Apps. 101 00:05:52,530 --> 00:05:55,710 Alright, so this course here, let me 102 00:05:55,710 --> 00:05:57,420 copy the string, so I don't make any 103 00:05:57,420 --> 00:06:00,720 typos. This is how you do that 104 00:06:00,870 --> 00:06:03,630 filtering, double filtering, you start 105 00:06:03,630 --> 00:06:07,230 with that, and then instead of putting 106 00:06:07,230 --> 00:06:09,840 one condition as we did in here, we 107 00:06:09,840 --> 00:06:12,480 would put two conditions which would be 108 00:06:12,600 --> 00:06:14,850 separated inside parentheses. So the 109 00:06:14,850 --> 00:06:18,270 first condition there, and the second 110 00:06:18,300 --> 00:06:21,510 condition in there. So as the first 111 00:06:21,510 --> 00:06:25,211 condition we put the same thing, so that's data 112 00:06:25,212 --> 00:06:27,387 [No audio] 113 00:06:27,388 --> 00:06:31,710 Rating is greater than 4. So 114 00:06:31,710 --> 00:06:33,660 you see this, the condition is inside 115 00:06:33,660 --> 00:06:36,120 the parenthesis. That's the first 116 00:06:36,120 --> 00:06:38,610 condition. The second condition is where 117 00:06:38,610 --> 00:06:44,902 data we set Course Name is equal to 118 00:06:45,632 --> 00:06:49,244 the string, The Complete Python Course: 119 00:06:49,432 --> 00:06:52,500 Build 10 Professional OOP Apps, right. 120 00:06:53,269 --> 00:06:58,050 Execute, and we get a DataFrame based 121 00:06:58,050 --> 00:07:00,720 on that filtering. So, since this is a 122 00:07:00,750 --> 00:07:04,380 DataFrame.DataFrame, we can 123 00:07:04,380 --> 00:07:09,600 extract rating out of that and then get 124 00:07:09,600 --> 00:07:12,060 the mean of the rating column. And that 125 00:07:12,060 --> 00:07:15,990 will give us this number. So the average 126 00:07:15,990 --> 00:07:21,570 rating of the ratings above 4, '4.' course. 127 00:07:21,571 --> 00:07:27,328 [No audio] 128 00:07:27,330 --> 00:07:30,990 And this is the filter DataFrame again. 129 00:07:30,991 --> 00:07:33,148 [No audio] 130 00:07:33,150 --> 00:07:34,650 And that is above this lecture in the 131 00:07:34,650 --> 00:07:36,450 next lecture, we are going to look at 132 00:07:36,540 --> 00:07:41,220 filtering data based on times. So how 133 00:07:41,220 --> 00:07:43,183 you can extract portions of your 134 00:07:43,185 --> 00:07:46,080 DataFrame for a particular period. So you 135 00:07:46,080 --> 00:07:49,080 see we have this Timestamp column here, 136 00:07:49,080 --> 00:07:55,560 and that ranges from 2018 to 2021. 137 00:07:55,830 --> 00:07:58,336 Let's see how you do that in the next video.