1
00:00:00,804 --> 00:00:04,050
So, as you see, it's quite a simple one.

2
00:00:04,051 --> 00:00:06,372
And intentionally, I tried to find

3
00:00:06,373 --> 00:00:07,844
a simple web page for you.

4
00:00:07,845 --> 00:00:09,364
So here we go.

5
00:00:09,365 --> 00:00:13,780
I didn't want to distract you with lots of content

6
00:00:13,781 --> 00:00:18,292
for now, later you will be able to grab information

7
00:00:18,293 --> 00:00:20,900
from a big website with lots of data.

8
00:00:20,901 --> 00:00:25,964
So, for now let's try to grab,

9
00:00:25,965 --> 00:00:29,068
let's say, we want to extract the

10
00:00:29,069 --> 00:00:32,600
names of the cities from this page.

11
00:00:33,370 --> 00:00:35,580
So if you want to follow me, please

12
00:00:35,581 --> 00:00:38,656
type in this address on your address bar,

13
00:00:38,657 --> 00:00:41,260
so with .html at the end.

14
00:00:41,600 --> 00:00:46,608
And, so we've got only three cities here that we

15
00:00:46,609 --> 00:00:49,856
will be extracting, but the code that we will

16
00:00:49,857 --> 00:00:53,258
write, will work with any number of rows,

17
00:00:53,259 --> 00:00:57,572
here I'll be using the iPython Notebook, or

18
00:00:57,573 --> 00:01:00,916
the Jupyter Notebook as it is called now.

19
00:01:00,917 --> 00:01:03,370
So it was renamed to Jupyter Notebook.

20
00:01:03,371 --> 00:01:05,832
So right, Shift, right click and

21
00:01:05,833 --> 00:01:10,770
open your command line, jupyter notebook.

22
00:01:10,770 --> 00:01:12,790
[No Audio]

23
00:01:12,791 --> 00:01:17,750
And I'll create a Python 3 notebook.

24
00:01:19,290 --> 00:01:22,236
Great, so the first thing you want to do is

25
00:01:22,237 --> 00:01:27,190
you want to load, this source code in Python.

26
00:01:28,010 --> 00:01:34,570
And the way to do that, is by using the requests library.

27
00:01:35,550 --> 00:01:37,952
So if you don't have that installed, you can

28
00:01:37,953 --> 00:01:41,882
just go ahead and install it with, pip

29
00:01:41,883 --> 00:01:45,680
install requests, just like that.

30
00:01:46,450 --> 00:01:50,910
I have it already, so already satisfied,

31
00:01:52,210 --> 00:01:53,832
but the process is very easy.

32
00:01:53,833 --> 00:01:57,634
So you already know how to install packages with pip.

33
00:01:58,166 --> 00:02:01,150
And you'll also need the BeautifulSoup library.

34
00:02:01,151 --> 00:02:05,708
So to install that, you need to say pip install again.

35
00:02:05,709 --> 00:02:08,520
And not BeautifulSoup, but bs4,

36
00:02:10,100 --> 00:02:13,996
so which stands for BeautifulSoup 4.

37
00:02:13,997 --> 00:02:16,550
So that's the latest version of BeautifulSoup.

38
00:02:17,210 --> 00:02:22,100
And then, so you want to import requests, and,

39
00:02:22,101 --> 00:02:25,312
so, the first thing you want to do is load the source code.

40
00:02:25,313 --> 00:02:28,022
And then we start looking for html

41
00:02:28,023 --> 00:02:30,890
tags and extracting elements from those tags.

42
00:02:31,730 --> 00:02:34,452
But let me import BeautifulSoup as well.

43
00:02:34,453 --> 00:02:39,200
So from bs4 import

44
00:02:39,201 --> 00:02:41,330
[Author Typing]

45
00:02:41,331 --> 00:02:42,810
BeautifulSoup.

46
00:02:42,811 --> 00:02:43,982
So that's the syntax.

47
00:02:43,983 --> 00:02:47,460
You're importing the BeautifulSoup class from bs4.

48
00:02:48,230 --> 00:02:50,100
If you are on Python 2,

49
00:02:50,533 --> 00:02:52,536
and this should be slightly different.

50
00:02:52,537 --> 00:02:54,766
So you want to import BeautifulSoup directly,

51
00:02:54,767 --> 00:02:56,760
directly like this.

52
00:02:56,766 --> 00:02:58,890
[No Audio]

53
00:02:58,891 --> 00:03:01,980
Okay, Alt+Enter and go to the next line.

54
00:03:01,981 --> 00:03:06,972
So to load a web page, it's good

55
00:03:06,973 --> 00:03:09,286
to create a variable, so you can load

56
00:03:09,287 --> 00:03:12,454
the web page source code to this variable.

57
00:03:12,455 --> 00:03:17,840
So r=requests.get.

58
00:03:17,841 --> 00:03:19,302
So the get method.

59
00:03:19,303 --> 00:03:21,572
So you point to the library, and then to the get

60
00:03:21,573 --> 00:03:26,602
method, and all you need to pass here is the URL

61
00:03:26,603 --> 00:03:28,362
of the webpage that you want to load.

62
00:03:28,363 --> 00:03:35,006
So in this case, http://pythonhow.com/example.html.

63
00:03:35,007 --> 00:03:36,382
So don't forget the html.

64
00:03:36,383 --> 00:03:37,960
This is just a static web page,

65
00:03:37,961 --> 00:03:41,780
so you should pass html there.

66
00:03:42,310 --> 00:03:45,033
Now this should create,

67
00:03:45,034 --> 00:03:47,500
[Author Typing]

68
00:03:47,501 --> 00:03:49,450
a request object.

69
00:03:49,451 --> 00:03:52,010
So we're still not there.

70
00:03:52,011 --> 00:03:54,162
And what you want to do is grab

71
00:03:54,163 --> 00:03:59,632
the content from this request data type, and

72
00:03:59,633 --> 00:04:01,382
maybe store it in another variable.

73
00:04:01,383 --> 00:04:04,646
So the content stored in a c variable

74
00:04:04,647 --> 00:04:08,912
like that, and if you want to check

75
00:04:08,913 --> 00:04:14,022
now, what this c, you'll see that this is a bytes

76
00:04:14,023 --> 00:04:18,480
data type, and you can print it if you want.

77
00:04:18,480 --> 00:04:20,850
[No Audio]

78
00:04:20,851 --> 00:04:25,688
Even though this doesn't look very nice, this is

79
00:04:25,689 --> 00:04:29,700
actually the source code, that you see in here.

80
00:04:30,310 --> 00:04:33,928
So we have the head tags and

81
00:04:33,929 --> 00:04:37,564
the html tags, and everything else there.

82
00:04:37,565 --> 00:04:40,012
And now, here is where the

83
00:04:40,013 --> 00:04:42,920
BeautifulSoup comes into play.

84
00:04:43,450 --> 00:04:47,452
So all the request does is, it loads the

85
00:04:47,453 --> 00:04:50,288
source code of the webpage, but in a

86
00:04:50,289 --> 00:04:53,660
very scrambled form as you see here.

87
00:04:54,270 --> 00:04:58,422
Now if you want to make this beautiful, and extract

88
00:04:58,423 --> 00:05:01,648
the elements and the text and everything out of this

89
00:05:01,649 --> 00:05:05,014
source code, you want to use BeautifulSoup.

90
00:05:05,015 --> 00:05:09,028
So all BeautifulSoup does, is parsing this

91
00:05:09,029 --> 00:05:12,116
source code, and giving you what you want.

92
00:05:12,117 --> 00:05:14,388
So giving you the elements of

93
00:05:14,389 --> 00:05:16,740
the html text, you're interested about.

94
00:05:17,510 --> 00:05:21,288
So you have already loaded this content and

95
00:05:21,289 --> 00:05:22,856
now what you want to do is maybe

96
00:05:22,857 --> 00:05:25,090
create a variable and call it soup.

97
00:05:25,100 --> 00:05:28,070
[Author Typing]

98
00:05:28,071 --> 00:05:31,130
And that would be equal to a BeautifulSoup.

99
00:05:31,131 --> 00:05:33,468
And guess what you want to pass here?

100
00:05:33,469 --> 00:05:38,270
Well, that would be the content, and maybe another argument.

101
00:05:38,271 --> 00:05:43,920
So you want to specify, the parser you want to use

102
00:05:43,921 --> 00:05:49,078
for parsing this data. That is normally the html.parser.

103
00:05:49,079 --> 00:05:51,810
So this is what you want to use.

104
00:05:51,811 --> 00:05:55,818
Almost always, if you don't specify this, you'll

105
00:05:55,819 --> 00:05:58,666
get a warning, but still things will work.

106
00:05:58,948 --> 00:06:02,292
So I normally pass it there and once

107
00:06:02,293 --> 00:06:04,980
you've done that, so execute that cell.

108
00:06:05,750 --> 00:06:13,918
If you now print soup.prettify with empty

109
00:06:13,919 --> 00:06:17,724
brackets there, you'll see the source code of

110
00:06:17,725 --> 00:06:21,640
the webpage in an organized form.

111
00:06:22,170 --> 00:06:26,210
So BeautifulSoup is trained to actually recognize

112
00:06:26,211 --> 00:06:29,184
these tags, and then render them in a

113
00:06:29,185 --> 00:06:31,610
visual way for the human eye.

114
00:06:32,270 --> 00:06:34,998
However, this is just for demonstration.

115
00:06:34,999 --> 00:06:38,048
Normally you'll not have to actually use the

116
00:06:38,049 --> 00:06:43,230
prettify, method a lot, because a better method

117
00:06:43,810 --> 00:06:46,884
to see this code, as I already mentioned

118
00:06:46,885 --> 00:06:49,166
before, is to let me delete the cell.

119
00:06:49,172 --> 00:06:50,300
We don't need that.

120
00:06:50,300 --> 00:06:52,700
So a better way to see that source code is,

121
00:06:52,708 --> 00:06:55,966
to go to your webpage and go to Inspect.

122
00:06:55,967 --> 00:06:58,066
[No Audio]

123
00:06:58,070 --> 00:07:02,800
And here you see a better syntax of the html code.

124
00:07:04,150 --> 00:07:07,333
So here you'll see that, we have

125
00:07:07,333 --> 00:07:11,733
three divisions here, with a cities class.

126
00:07:11,740 --> 00:07:14,666
We have some more divisions here, but

127
00:07:14,668 --> 00:07:16,599
this is what we're interested about.

128
00:07:16,600 --> 00:07:18,666
[No Audio]

129
00:07:18,667 --> 00:07:21,466
So, and the body is everything.

130
00:07:23,070 --> 00:07:26,454
And if you expand one of these divisions, you'll

131
00:07:26,455 --> 00:07:28,800
see that, we have an h2 tag.

132
00:07:29,310 --> 00:07:33,082
So a heading tag, and also a paragraph tag.

133
00:07:33,083 --> 00:07:35,470
So p tag and h2 tags.

134
00:07:36,050 --> 00:07:39,988
And also the other division, which is this one here

135
00:07:39,989 --> 00:07:42,666
has this h2 tag and the paragraph tag.

136
00:07:42,667 --> 00:07:44,900
And Tokyo also has the same thing.

137
00:07:45,670 --> 00:07:49,048
So our duty now is, to

138
00:07:49,049 --> 00:07:53,830
extract the names of these elements.

139
00:07:53,831 --> 00:07:57,788
So that should be, the h2, the text of the

140
00:07:57,789 --> 00:08:00,900
h2 tags, inside the cities tags.

141
00:08:01,530 --> 00:08:05,756
So naturally you start thinking about iterating, through

142
00:08:05,757 --> 00:08:10,368
these boxes, which are actually divisions, so you

143
00:08:10,369 --> 00:08:12,608
want to go through here, here and here

144
00:08:12,609 --> 00:08:14,890
and extract what you want to extract.

145
00:08:16,350 --> 00:08:21,412
So we go back to the code, and what you

146
00:08:21,413 --> 00:08:27,810
want to do is perform a method called find_all.

147
00:08:27,811 --> 00:08:31,630
And what you want to find is divs.

148
00:08:31,966 --> 00:08:38,766
So, divs, but, there may be lots of divs in the webpage.

149
00:08:39,030 --> 00:08:41,539
So, for instance, we have two more divs here.

150
00:08:42,710 --> 00:08:44,824
And we don't want these to be found,

151
00:08:44,825 --> 00:08:46,360
we only want these three.

152
00:08:46,766 --> 00:08:49,692
So, but these three, as you see, they have a

153
00:08:49,693 --> 00:08:54,570
common class attribute, which is equal to cities.

154
00:08:54,571 --> 00:08:56,920
So we want to make use of that.

155
00:08:57,790 --> 00:09:00,752
And we pass here a dictionary, which

156
00:09:00,753 --> 00:09:07,533
should be class equals to cities.

157
00:09:09,230 --> 00:09:12,700
Okay, and let me create a variable here,

158
00:09:12,900 --> 00:09:15,940
and call it all and execute it.

159
00:09:15,941 --> 00:09:18,633
Now, if you print all,

160
00:09:18,634 --> 00:09:22,333
[No Audio]

161
00:09:22,334 --> 00:09:27,620
you'll see that the divisions have been extracted, from the source code.

162
00:09:28,230 --> 00:09:32,500
So from the soup, which was the entire source code.

163
00:09:33,030 --> 00:09:37,100
And I'd like you to actually see closely here.

164
00:09:37,101 --> 00:09:38,332
You can see that the first

165
00:09:38,333 --> 00:09:41,404
division, is divided by comma here.

166
00:09:41,405 --> 00:09:44,146
And then the second division starts up for Paris.

167
00:09:44,147 --> 00:09:48,848
Paris is a second, and it ends here.

168
00:09:48,849 --> 00:09:50,656
And then Tokyo starts here.

169
00:09:50,657 --> 00:09:52,272
So we've got a list with

170
00:09:52,273 --> 00:09:54,810
three elements, one for each division.

171
00:09:55,710 --> 00:09:58,048
Now, if you want to find only the

172
00:09:58,049 --> 00:10:03,498
first element, with this class attribute of cities,

173
00:10:03,499 --> 00:10:08,800
you'd want to use the find methods, all.

174
00:10:10,050 --> 00:10:13,233
So in this case, you don't get a list, but you get the,

175
00:10:14,566 --> 00:10:17,860
code for the division, for the first division only,

176
00:10:18,710 --> 00:10:20,299
which happens to be,

177
00:10:20,300 --> 00:10:23,533
[No Audio]

178
00:10:23,534 --> 00:10:25,830
a tag element of BeautifulSoup.

179
00:10:26,490 --> 00:10:29,788
So it's not a plain string, but it's a

180
00:10:29,789 --> 00:10:33,010
special, let's say a special BeautifulSoup string.

181
00:10:33,011 --> 00:10:36,652
So that BeautifulSoup knows its structure, so it

182
00:10:36,653 --> 00:10:39,968
knows what are elements, so where the tags are

183
00:10:39,969 --> 00:10:41,440
and where the text is and so on.

184
00:10:41,441 --> 00:10:44,066
So that BeautifulSoup is able to give you,

185
00:10:44,067 --> 00:10:47,357
the information that you are looking for.

186
00:10:49,100 --> 00:10:51,170
So, all again.

187
00:10:51,171 --> 00:10:53,194
So you extract the first element.

188
00:10:53,195 --> 00:10:55,818
Now, an alternative way to extract

189
00:10:55,819 --> 00:10:58,030
the first element is logically.

190
00:10:59,250 --> 00:11:05,490
So we have all elements here, is to use list indexing.

191
00:11:07,030 --> 00:11:10,760
So this object that I just showed you,

192
00:11:10,761 --> 00:11:16,418
the tag object of BeautifulSoup supports indexing.

193
00:11:16,419 --> 00:11:18,120
So you execute that.

194
00:11:19,050 --> 00:11:21,300
And in this case, as you see,

195
00:11:22,533 --> 00:11:26,140
you extracted the first item of the tag object.

196
00:11:26,141 --> 00:11:27,820
Or you could do it like this.

197
00:11:27,821 --> 00:11:30,512
So you grab all of them.

198
00:11:30,513 --> 00:11:31,872
So here you have all of them,

199
00:11:31,873 --> 00:11:34,432
and zero is the first one.

200
00:11:34,433 --> 00:11:35,950
You get the idea?

201
00:11:35,951 --> 00:11:38,768
Okay, but what if you want only the

202
00:11:38,769 --> 00:11:43,200
h2 tags, from this div class?

203
00:11:43,970 --> 00:11:48,228
Well, in that case, what you'd want to do is refer to

204
00:11:48,229 --> 00:11:52,880
the all object, and then apply the find_all method again.

205
00:11:54,290 --> 00:11:57,966
And this time, you'd want to get the h2 element.

206
00:11:57,967 --> 00:12:01,518
And in this case, you don't have a class attribute, so you'll

207
00:12:01,519 --> 00:12:04,622
have to leave it like that, and you get an error.

208
00:12:04,623 --> 00:12:07,938
Because, what I did here, is I didn't

209
00:12:07,939 --> 00:12:13,596
point to this division, but I pointed to

210
00:12:13,597 --> 00:12:19,634
actually the list, containing all these divisions.

211
00:12:19,635 --> 00:12:23,024
So Python is trying to get the h2, but

212
00:12:23,025 --> 00:12:26,790
this ResultSet method doesn't have this h2 element.

213
00:12:26,791 --> 00:12:29,184
So what you want to do is, you want to

214
00:12:29,185 --> 00:12:35,268
point to the first element, element of the list and

215
00:12:35,269 --> 00:12:39,090
that gives you the h2 element with a tags

216
00:12:39,091 --> 00:12:43,160
and text, which is like a list.

217
00:12:43,161 --> 00:12:46,680
So you want to perform a zero indexing there.

218
00:12:46,681 --> 00:12:49,288
And if you want London only, you

219
00:12:49,289 --> 00:12:51,766
apply text, and you get London.

220
00:12:51,767 --> 00:12:53,830
[No Audio]

221
00:12:53,831 --> 00:12:55,256
So this is what we wanted,

222
00:12:55,257 --> 00:12:57,110
right, to extract the cities.

223
00:12:57,633 --> 00:13:00,940
So we extracted London. Now,

224
00:13:00,941 --> 00:13:03,750
how about extracting Paris and Tokyo?

225
00:13:04,810 --> 00:13:09,154
Well, as you might guess, we need to use a for loop.

226
00:13:09,155 --> 00:13:12,128
But first, let me summarize what we did here.

227
00:13:12,129 --> 00:13:15,392
So we loaded the content up

228
00:13:15,393 --> 00:13:17,552
here, which is this one here.

229
00:13:17,553 --> 00:13:18,832
And then we loaded this

230
00:13:18,833 --> 00:13:22,298
content, in the BeautifulSoup method.

231
00:13:22,299 --> 00:13:25,924
And BeautifulSoup makes this soup beautiful,

232
00:13:25,925 --> 00:13:28,666
so that it recognizes these tags.

233
00:13:28,667 --> 00:13:32,564
And so what we did then is we found, we

234
00:13:32,565 --> 00:13:38,670
extracted from this content, we extracted all the division elements.

235
00:13:38,671 --> 00:13:41,928
So together with the text, and the attributes and

236
00:13:41,929 --> 00:13:45,986
the text inside them, so everything inside these divisions

237
00:13:45,987 --> 00:13:49,756
with a class equals to cities, then we can

238
00:13:49,757 --> 00:13:52,600
perform for each of these

239
00:13:52,601 --> 00:13:54,433
[No Audio]

240
00:13:54,434 --> 00:13:56,600
elements of this list.

241
00:13:57,130 --> 00:14:00,108
We can perform again a find_all method, so

242
00:14:00,109 --> 00:14:04,533
we can find subtags of these division tags.

243
00:14:04,806 --> 00:14:08,246
And in this case, we found the h2 tags.

244
00:14:08,247 --> 00:14:11,360
And then we grabbed the first item of the list, which in

245
00:14:11,361 --> 00:14:14,762
this case, happened to be a list with only one item.

246
00:14:14,763 --> 00:14:19,066
So each of these divisions have, one h2 tags.

247
00:14:19,067 --> 00:14:21,620
Or alternatively, you could just use find

248
00:14:21,621 --> 00:14:25,030
here and without using this indexing.

249
00:14:25,031 --> 00:14:27,374
But this is a general method.

250
00:14:27,375 --> 00:14:31,032
And then we apply the text attribute there.

251
00:14:31,033 --> 00:14:34,633
So to extract the text out of this element.

252
00:14:34,974 --> 00:14:36,390
So we got London.

253
00:14:36,970 --> 00:14:39,548
Now we need to do the same,

254
00:14:39,549 --> 00:14:41,874
but, in this case by iterating.

255
00:14:41,875 --> 00:14:49,500
So for, let's say, item in all, you want to print out.

256
00:14:51,150 --> 00:14:55,232
So item here is, this one here.

257
00:14:55,233 --> 00:14:57,782
So this would be the first item.

258
00:14:57,783 --> 00:15:04,566
So you want to print out, the item.find_all,

259
00:15:06,133 --> 00:15:12,160
and you want to find the h2 tags from this first item, for example.

260
00:15:12,950 --> 00:15:18,824
So h2 tags. And then you need to apply this zero indexing there

261
00:15:18,825 --> 00:15:22,120
and you want to grab the text from this.

262
00:15:22,121 --> 00:15:23,300
And that's it.

263
00:15:24,470 --> 00:15:26,092
Here are the data.

264
00:15:26,093 --> 00:15:28,652
Alternatively, you could just pass p

265
00:15:28,653 --> 00:15:30,950
here, and you get the paragraphs.

266
00:15:32,650 --> 00:15:35,320
So this one's here, the text.

267
00:15:37,130 --> 00:15:40,454
So that's the idea of loading webpages

268
00:15:40,455 --> 00:15:44,198
in Python, and parsing them with BeautifulSoup

269
00:15:44,199 --> 00:15:49,070
and extracting text out of the webpage.

270
00:15:49,071 --> 00:15:52,058
So, sorry if I was a bit repetitive

271
00:15:52,059 --> 00:15:55,332
in explaining this stuff, but I really want

272
00:15:55,333 --> 00:15:58,750
to make sure you understand the core concepts.

273
00:15:59,570 --> 00:16:01,620
On the other hand, if you found this

274
00:16:01,621 --> 00:16:05,188
very basic, I would say let's move on

275
00:16:05,189 --> 00:16:08,362
to the next lectures, where we'll be extracting

276
00:16:08,363 --> 00:16:11,280
some information from a more advanced website.

277
00:16:11,890 --> 00:16:15,060
And we'll be extracting links and not only text.

278
00:16:15,061 --> 00:16:17,628
So that's a real world program,

279
00:16:17,629 --> 00:16:19,610
and a very interesting one.

280
00:16:19,611 --> 00:16:21,666
So I'll talk to you later.