1
00:00:00,000 --> 00:00:02,693
All right, now that you know how to load

2
00:00:02,699 --> 00:00:05,579
data in Python via Pandas, and you know

3
00:00:05,579 --> 00:00:07,289
how to do that through using different

4
00:00:07,289 --> 00:00:10,229
data sources, so csv, json, and text

5
00:00:10,229 --> 00:00:13,649
files, and Excel, and now you want to

6
00:00:13,679 --> 00:00:15,539
understand how and learn how to

7
00:00:15,539 --> 00:00:18,869
manipulate these DataFrames, and by

8
00:00:18,869 --> 00:00:21,569
manipulation, what I mean is deleting

9
00:00:21,569 --> 00:00:23,759
rows and columns from your DataFrame,

10
00:00:24,119 --> 00:00:26,969
and adding new rows and columns, and

11
00:00:26,969 --> 00:00:29,429
also modifying existing rows and

12
00:00:29,429 --> 00:00:31,379
columns. So that's what you're going to

13
00:00:31,379 --> 00:00:34,529
learn throughout these lectures. But

14
00:00:34,529 --> 00:00:37,079
first of all, I'd like you to understand

15
00:00:37,144 --> 00:00:40,144
how DataFrames are indexed, and with

16
00:00:40,169 --> 00:00:43,319
indexing, I mean, you know, we have this

17
00:00:43,319 --> 00:00:45,539
DataFrame here, and this can be a big

18
00:00:45,539 --> 00:00:47,219
one, because all these happens to be a

19
00:00:47,219 --> 00:00:49,919
shorter one with only six rows. But if

20
00:00:49,919 --> 00:00:51,419
you have big DataFrames with lots of

21
00:00:51,419 --> 00:00:53,969
columns and rows, then you may want to

22
00:00:53,999 --> 00:00:56,821
extract information out of the DataFrame,

23
00:00:56,845 --> 00:00:58,619
and to extract information, you

24
00:00:58,619 --> 00:01:01,589
need to have like a coordinate system,

25
00:01:02,429 --> 00:01:04,949
only that DataFrame like an embedded

26
00:01:04,949 --> 00:01:07,379
coordinate system. So that if you want

27
00:01:07,379 --> 00:01:10,409
to access, let's say, so these two rows

28
00:01:10,409 --> 00:01:13,289
here, this portion here, you want to

29
00:01:13,289 --> 00:01:15,689
know how to do that. So that's what

30
00:01:15,689 --> 00:01:17,609
you're going to learn know how DataFrames

31
00:01:17,639 --> 00:01:19,439
are indexed and how you can slice them.

32
00:01:19,769 --> 00:01:23,339
So let's try to extract that portion of

33
00:01:23,339 --> 00:01:24,959
the DataFrame. There might be different

34
00:01:24,959 --> 00:01:27,521
ways to access that portion of the DataFrame.

35
00:01:27,966 --> 00:01:30,988
The first way is to use a label-based

36
00:01:31,012 --> 00:01:33,149
indexing. The other way is to use

37
00:01:33,149 --> 00:01:36,145
position-based indexing. So your DataFrame

38
00:01:36,169 --> 00:01:38,909
has column labels and index

39
00:01:38,909 --> 00:01:41,999
labels. So now you can use labels from

40
00:01:41,999 --> 00:01:44,759
your index column and labels from

41
00:01:44,759 --> 00:01:47,849
your header, your column names to access

42
00:01:47,879 --> 00:01:50,849
portions of your DataFrame. With label

43
00:01:50,874 --> 00:01:53,394
indexing, you want to use loc in there

44
00:01:54,029 --> 00:01:56,489
so the loc method, and then you pass

45
00:01:56,489 --> 00:01:58,529
square brackets in there, and then that

46
00:01:58,529 --> 00:02:01,769
gets two elements, and the first element

47
00:02:01,769 --> 00:02:06,179
could be a range of the index column. So

48
00:02:06,179 --> 00:02:07,799
we're talking about labels and not

49
00:02:07,799 --> 00:02:10,469
strings, so you will have to pass, you

50
00:02:10,469 --> 00:02:16,649
know 735 Dolores St, and then a

51
00:02:16,649 --> 00:02:23,121
range, so with a column there, 332 Hill St,

52
00:02:23,145 --> 00:02:24,859
and then from Country

53
00:02:24,883 --> 00:02:27,234
[No audio]

54
00:02:27,259 --> 00:02:29,249
to ID, execute

55
00:02:29,249 --> 00:02:32,279
that. I know this is our portion. So when

56
00:02:32,279 --> 00:02:34,529
you use labels, you're including the

57
00:02:34,529 --> 00:02:36,749
first label that you pass there and the

58
00:02:36,749 --> 00:02:39,509
last one as well. So everything between

59
00:02:39,509 --> 00:02:42,569
those, and like here, Country and

60
00:02:42,594 --> 00:02:44,814
Employees is included as well, but ID

61
00:02:45,034 --> 00:02:48,904
also, and of course, similarly, almost

62
00:02:48,929 --> 00:02:52,949
similarly, you can access, you know,

63
00:02:52,949 --> 00:02:56,489
single cells from your DataFrame, just

64
00:02:56,489 --> 00:02:59,219
like that. So the intersection between

65
00:02:59,219 --> 00:03:03,629
this index label and this column name is

66
00:03:03,629 --> 00:03:06,959
USA, which would be this one here. If

67
00:03:06,959 --> 00:03:10,949
you want all the USAs, then you just

68
00:03:10,949 --> 00:03:13,619
pass everything there, and you get

69
00:03:13,769 --> 00:03:16,889
everything here, which of course, if you

70
00:03:16,889 --> 00:03:19,374
want, you can convert it to list.

71
00:03:19,398 --> 00:03:21,655
[No audio]

72
00:03:21,685 --> 00:03:25,864
So a simple list using the Python built-in

73
00:03:25,894 --> 00:03:29,014
function, which is list and that's about

74
00:03:29,039 --> 00:03:31,559
label-based indexing. Now, this is not

75
00:03:31,559 --> 00:03:34,799
the common way to access to extract data

76
00:03:34,799 --> 00:03:37,589
from a DataFrame. More common could be

77
00:03:37,589 --> 00:03:40,949
to access a data based on indexing, not

78
00:03:40,949 --> 00:03:46,889
based on labels. So to do that, you do

79
00:03:48,042 --> 00:03:51,184
df7, and instead of loc, you do iloc.

80
00:03:53,014 --> 00:03:56,794
That, again, expects two items. So the first

81
00:03:56,819 --> 00:03:59,099
would be the range of your indexes.

82
00:03:59,123 --> 00:04:01,354
[No audio]

83
00:04:01,379 --> 00:04:04,755
Actually, let me print all the DataFrame

84
00:04:04,779 --> 00:04:05,921
here so that you

85
00:04:05,945 --> 00:04:08,195
[No audio]

86
00:04:08,220 --> 00:04:13,690
can refer to that. So let me access from

87
00:04:13,715 --> 00:04:17,189
Dolores to 23rd street, and that

88
00:04:17,189 --> 00:04:23,909
would be 1 to 3, I believe, yep, and also

89
00:04:23,909 --> 00:04:28,337
from Country to ID. So again 1 to

90
00:04:28,361 --> 00:04:30,361
[No audio]

91
00:04:30,371 --> 00:04:31,455
3,

92
00:04:31,479 --> 00:04:34,385
[No audio]

93
00:04:34,410 --> 00:04:36,269
and here, you can see the difference now,

94
00:04:36,869 --> 00:04:38,939
you know, the ID wasn't included there

95
00:04:38,969 --> 00:04:42,449
and neither was 23rd Street, because this

96
00:04:42,449 --> 00:04:45,119
is as, as you do with lists, this is upper

97
00:04:45,119 --> 00:04:47,697
bound exclusive. So with Python list

98
00:04:47,723 --> 00:04:49,829
3 is not included in the

99
00:04:49,829 --> 00:04:53,279
slice, but with labels, that the last

100
00:04:53,309 --> 00:04:56,009
item or the range was included in the

101
00:04:56,284 --> 00:04:58,984
slice. So in this case, you want to pass

102
00:04:59,009 --> 00:05:01,289
4 there and 4 there, and

103
00:05:01,289 --> 00:05:03,809
that's how you get your portion, and of

104
00:05:03,809 --> 00:05:06,329
course, similarly, you can do things

105
00:05:06,329 --> 00:05:09,959
like that. So you get all the rows or

106
00:05:09,959 --> 00:05:11,455
only one of them.

107
00:05:11,479 --> 00:05:14,826
[No audio]

108
00:05:14,851 --> 00:05:16,139
So that would be a row

109
00:05:16,139 --> 00:05:19,169
with index 3, which is this one, but

110
00:05:19,169 --> 00:05:21,299
only four columns, Country,

111
00:05:21,599 --> 00:05:24,779
Employees, and ID. So USA, 10, and 4.

112
00:05:25,354 --> 00:05:27,634
All right, that is position-based

113
00:05:27,659 --> 00:05:29,579
indexing, and yeah, that's what I wanted

114
00:05:29,579 --> 00:05:32,009
to teach you about DataFrame indexing

115
00:05:32,009 --> 00:05:35,199
and slicing, and I'll talk to you in the next lecture.