1
00:00:00,000 --> 00:00:02,700
Hi, welcome back. In this video, you are

2
00:00:02,700 --> 00:00:04,950
going to learn how to filter data from

3
00:00:04,950 --> 00:00:07,410
your DataFrame. By that I mean

4
00:00:07,620 --> 00:00:11,250
extracting particular groups of data

5
00:00:11,250 --> 00:00:13,650
based on different types of conditions.

6
00:00:13,770 --> 00:00:16,357
For example, let's say in our DataFrame,

7
00:00:16,751 --> 00:00:18,150
we have different courses, even

8
00:00:18,150 --> 00:00:19,620
though here you only see the Python

9
00:00:19,620 --> 00:00:21,810
Mega Course. Because we are only seeing the

10
00:00:21,810 --> 00:00:24,540
first five rows, we have other courses

11
00:00:24,540 --> 00:00:27,030
with ratings. So if you want it, for

12
00:00:27,030 --> 00:00:29,850
example, to know the average rating for

13
00:00:29,850 --> 00:00:33,300
a particular course only, then you'd

14
00:00:33,300 --> 00:00:36,360
have to filter the data by applying some

15
00:00:36,360 --> 00:00:39,539
conditions. So filtering allows you to do

16
00:00:39,540 --> 00:00:43,228
analysis on a particular group of your data.

17
00:00:45,087 --> 00:00:49,617
Let's do that. First, I'm going to assign a number

18
00:00:49,618 --> 00:00:54,570
to each of the main headings. So that was an overview,

19
00:00:54,750 --> 00:00:57,330
the second lecture was selecting. So

20
00:00:57,330 --> 00:01:00,832
that's number 2. This time we

21
00:01:00,833 --> 00:01:04,312
[No audio]

22
00:01:04,313 --> 00:01:10,230
are at number 3, filtering data based on

23
00:01:11,280 --> 00:01:14,790
conditions. Right, so let's apply one

24
00:01:14,790 --> 00:01:16,770
condition first. So I'm going to create

25
00:01:16,770 --> 00:01:20,760
a Markdown cell and say, One condition.

26
00:01:22,710 --> 00:01:25,740
Let's apply a condition which would give

27
00:01:25,770 --> 00:01:30,690
us the DataFrame where the ratings are

28
00:01:30,690 --> 00:01:36,030
greater than 4. To do that, you'd

29
00:01:36,030 --> 00:01:38,580
have to follow this syntax. Now first,

30
00:01:38,610 --> 00:01:41,910
make sure that you have your data

31
00:01:41,910 --> 00:01:45,450
variable in the namespace by executing

32
00:01:45,450 --> 00:01:48,750
this first cell. So Ctrl Enter to

33
00:01:48,750 --> 00:01:51,300
execute that cell. Now we are sure that

34
00:01:51,330 --> 00:01:54,480
data is in the namespace. So this

35
00:01:54,480 --> 00:01:57,720
is the syntax data square brackets,

36
00:01:57,750 --> 00:02:00,507
and inside here goes the condition.

37
00:02:01,956 --> 00:02:05,010
So we said that we want the DataFrame

38
00:02:05,010 --> 00:02:08,550
where the rating is greater than 4.

39
00:02:08,790 --> 00:02:11,820
Therefore, we say column Rating is

40
00:02:11,820 --> 00:02:15,030
greater, let's leave some space greater

41
00:02:15,030 --> 00:02:20,100
than 4. If we execute that, we get a

42
00:02:20,100 --> 00:02:23,040
DataFrame, and all these rows have a

43
00:02:23,040 --> 00:02:27,990
rating of 4.5 or more. So not 4,

44
00:02:27,990 --> 00:02:30,690
because it's greater, it's not greater

45
00:02:30,690 --> 00:02:33,720
or equal. If it was greater or equal,

46
00:02:33,840 --> 00:02:39,420
we'll also get the 4 ratings 4.0. But

47
00:02:39,420 --> 00:02:41,220
this is not the case here. So this is

48
00:02:41,220 --> 00:02:45,090
what we get. You see, for example, that

49
00:02:45,120 --> 00:02:48,060
rows with index 0, 1, 2, were not included

50
00:02:48,090 --> 00:02:51,476
in the DataFrame in the output DataFrame.

51
00:02:52,742 --> 00:02:54,900
And you can see the length of

52
00:02:54,900 --> 00:02:57,882
this using a len function at Python,

53
00:02:57,883 --> 00:03:04,440
len function. So we get 29,758

54
00:03:04,470 --> 00:03:07,800
rows that have a rating of greater than

55
00:03:07,800 --> 00:03:10,110
4. Another way, if you don't want to

56
00:03:10,110 --> 00:03:12,960
use a native Python function, another

57
00:03:12,960 --> 00:03:18,483
way is to use count. So that is the DataFrame

58
00:03:18,484 --> 00:03:21,270
object, and the DataFrame objects

59
00:03:21,270 --> 00:03:25,170
have a count method. And you get this

60
00:03:25,200 --> 00:03:28,110
output, which basically says that we

61
00:03:28,110 --> 00:03:30,630
have this amount of values, in Course

62
00:03:30,630 --> 00:03:32,790
Name this amount of values in Timestamp,

63
00:03:32,792 --> 00:03:34,632
these amounts in Rating and,

64
00:03:35,007 --> 00:03:38,880
and Comments. Because comments have this NaN

65
00:03:38,880 --> 00:03:41,910
values, that is basically a non existent

66
00:03:41,910 --> 00:03:45,150
value. It doesn't exist. Therefore, we

67
00:03:45,150 --> 00:03:50,610
have 4927 ratings with comments where

68
00:03:50,610 --> 00:03:53,460
people actually wrote something. So the

69
00:03:53,460 --> 00:03:56,580
rest of this minus that gives us the

70
00:03:56,580 --> 00:04:00,840
number of NaN value comments. Of

71
00:04:00,840 --> 00:04:03,390
course, you can apply different types of

72
00:04:03,660 --> 00:04:07,650
statistics to this now. For example,

73
00:04:07,941 --> 00:04:10,701
this is a DataFrame, right. How do we

74
00:04:10,950 --> 00:04:13,860
extract only the column? How do we select only

75
00:04:13,890 --> 00:04:17,580
the column Rating? Well, you could

76
00:04:18,270 --> 00:04:23,520
take a look at here, selecting a column.

77
00:04:24,150 --> 00:04:26,190
So this is how you select a column data

78
00:04:26,220 --> 00:04:30,180
Rating. Now, this is the DataFrame,

79
00:04:30,810 --> 00:04:33,660
therefore you apply Rating, and you

80
00:04:33,660 --> 00:04:37,109
get the ratings with a value of 4.5 or more.

81
00:04:37,581 --> 00:04:43,582
So to understand this, what we do is, let's say d2

82
00:04:43,583 --> 00:04:45,849
[No audio]

83
00:04:45,851 --> 00:04:51,186
is the DataFrame that we get to based on that

84
00:04:51,187 --> 00:04:53,761
condition, right. So it's a DataFrame,

85
00:04:54,230 --> 00:04:58,625
that DataFrame has higher rating column. So that's

86
00:04:58,626 --> 00:05:01,617
what we get. That's the same thing that we did

87
00:05:03,448 --> 00:05:07,499
in here, data Rating, d2 Rating. Okay,

88
00:05:07,500 --> 00:05:10,620
I hope that's clear. That also allows you to

89
00:05:10,622 --> 00:05:13,260
apply functions such as mean. So you get

90
00:05:13,260 --> 00:05:17,970
the mean of the ratings greater than

91
00:05:17,970 --> 00:05:21,914
4, right. I'm going to leave it as it was.

92
00:05:23,230 --> 00:05:25,905
And then next is

93
00:05:25,906 --> 00:05:27,938
[No audio]

94
00:05:27,940 --> 00:05:31,276
Multiple conditions.

95
00:05:31,277 --> 00:05:33,395
[No audio]

96
00:05:33,397 --> 00:05:35,820
Let's say we want to filter

97
00:05:35,820 --> 00:05:38,430
the DataFrame where the rating is

98
00:05:38,430 --> 00:05:41,220
greater than 4, and the Course Name

99
00:05:41,250 --> 00:05:46,890
is equal to the complete Python Course,

100
00:05:47,460 --> 00:05:50,760
Built 10 Professional OOP Apps.

101
00:05:52,530 --> 00:05:55,710
Alright, so this course here, let me

102
00:05:55,710 --> 00:05:57,420
copy the string, so I don't make any

103
00:05:57,420 --> 00:06:00,720
typos. This is how you do that

104
00:06:00,870 --> 00:06:03,630
filtering, double filtering, you start

105
00:06:03,630 --> 00:06:07,230
with that, and then instead of putting

106
00:06:07,230 --> 00:06:09,840
one condition as we did in here, we

107
00:06:09,840 --> 00:06:12,480
would put two conditions which would be

108
00:06:12,600 --> 00:06:14,850
separated inside parentheses. So the

109
00:06:14,850 --> 00:06:18,270
first condition there, and the second

110
00:06:18,300 --> 00:06:21,510
condition in there. So as the first

111
00:06:21,510 --> 00:06:25,211
condition we put the same thing, so that's data

112
00:06:25,212 --> 00:06:27,387
[No audio]

113
00:06:27,388 --> 00:06:31,710
Rating is greater than 4. So

114
00:06:31,710 --> 00:06:33,660
you see this, the condition is inside

115
00:06:33,660 --> 00:06:36,120
the parenthesis. That's the first

116
00:06:36,120 --> 00:06:38,610
condition. The second condition is where

117
00:06:38,610 --> 00:06:44,902
data we set Course Name is equal to

118
00:06:45,632 --> 00:06:49,244
the string, The Complete Python Course:

119
00:06:49,432 --> 00:06:52,500
Build 10 Professional OOP Apps, right.

120
00:06:53,269 --> 00:06:58,050
Execute, and we get a DataFrame based

121
00:06:58,050 --> 00:07:00,720
on that filtering. So, since this is a

122
00:07:00,750 --> 00:07:04,380
DataFrame.DataFrame, we can

123
00:07:04,380 --> 00:07:09,600
extract rating out of that and then get

124
00:07:09,600 --> 00:07:12,060
the mean of the rating column. And that

125
00:07:12,060 --> 00:07:15,990
will give us this number. So the average

126
00:07:15,990 --> 00:07:21,570
rating of the ratings above 4, '4.' course.

127
00:07:21,571 --> 00:07:27,328
[No audio]

128
00:07:27,330 --> 00:07:30,990
And this is the filter DataFrame again.

129
00:07:30,991 --> 00:07:33,148
[No audio]

130
00:07:33,150 --> 00:07:34,650
And that is above this lecture in the

131
00:07:34,650 --> 00:07:36,450
next lecture, we are going to look at

132
00:07:36,540 --> 00:07:41,220
filtering data based on times. So how

133
00:07:41,220 --> 00:07:43,183
you can extract portions of your

134
00:07:43,185 --> 00:07:46,080
DataFrame for a particular period. So you

135
00:07:46,080 --> 00:07:49,080
see we have this Timestamp column here,

136
00:07:49,080 --> 00:07:55,560
and that ranges from 2018 to 2021.

137
00:07:55,830 --> 00:07:58,336
Let's see how you do that in the next video.