1 00:00:00,804 --> 00:00:04,050 So, as you see, it's quite a simple one. 2 00:00:04,051 --> 00:00:06,372 And intentionally, I tried to find 3 00:00:06,373 --> 00:00:07,844 a simple web page for you. 4 00:00:07,845 --> 00:00:09,364 So here we go. 5 00:00:09,365 --> 00:00:13,780 I didn't want to distract you with lots of content 6 00:00:13,781 --> 00:00:18,292 for now, later you will be able to grab information 7 00:00:18,293 --> 00:00:20,900 from a big website with lots of data. 8 00:00:20,901 --> 00:00:25,964 So, for now let's try to grab, 9 00:00:25,965 --> 00:00:29,068 let's say, we want to extract the 10 00:00:29,069 --> 00:00:32,600 names of the cities from this page. 11 00:00:33,370 --> 00:00:35,580 So if you want to follow me, please 12 00:00:35,581 --> 00:00:38,656 type in this address on your address bar, 13 00:00:38,657 --> 00:00:41,260 so with .html at the end. 14 00:00:41,600 --> 00:00:46,608 And, so we've got only three cities here that we 15 00:00:46,609 --> 00:00:49,856 will be extracting, but the code that we will 16 00:00:49,857 --> 00:00:53,258 write, will work with any number of rows, 17 00:00:53,259 --> 00:00:57,572 here I'll be using the iPython Notebook, or 18 00:00:57,573 --> 00:01:00,916 the Jupyter Notebook as it is called now. 19 00:01:00,917 --> 00:01:03,370 So it was renamed to Jupyter Notebook. 20 00:01:03,371 --> 00:01:05,832 So right, Shift, right click and 21 00:01:05,833 --> 00:01:10,770 open your command line, jupyter notebook. 22 00:01:10,770 --> 00:01:12,790 [No Audio] 23 00:01:12,791 --> 00:01:17,750 And I'll create a Python 3 notebook. 24 00:01:19,290 --> 00:01:22,236 Great, so the first thing you want to do is 25 00:01:22,237 --> 00:01:27,190 you want to load, this source code in Python. 26 00:01:28,010 --> 00:01:34,570 And the way to do that, is by using the requests library. 27 00:01:35,550 --> 00:01:37,952 So if you don't have that installed, you can 28 00:01:37,953 --> 00:01:41,882 just go ahead and install it with, pip 29 00:01:41,883 --> 00:01:45,680 install requests, just like that. 30 00:01:46,450 --> 00:01:50,910 I have it already, so already satisfied, 31 00:01:52,210 --> 00:01:53,832 but the process is very easy. 32 00:01:53,833 --> 00:01:57,634 So you already know how to install packages with pip. 33 00:01:58,166 --> 00:02:01,150 And you'll also need the BeautifulSoup library. 34 00:02:01,151 --> 00:02:05,708 So to install that, you need to say pip install again. 35 00:02:05,709 --> 00:02:08,520 And not BeautifulSoup, but bs4, 36 00:02:10,100 --> 00:02:13,996 so which stands for BeautifulSoup 4. 37 00:02:13,997 --> 00:02:16,550 So that's the latest version of BeautifulSoup. 38 00:02:17,210 --> 00:02:22,100 And then, so you want to import requests, and, 39 00:02:22,101 --> 00:02:25,312 so, the first thing you want to do is load the source code. 40 00:02:25,313 --> 00:02:28,022 And then we start looking for html 41 00:02:28,023 --> 00:02:30,890 tags and extracting elements from those tags. 42 00:02:31,730 --> 00:02:34,452 But let me import BeautifulSoup as well. 43 00:02:34,453 --> 00:02:39,200 So from bs4 import 44 00:02:39,201 --> 00:02:41,330 [Author Typing] 45 00:02:41,331 --> 00:02:42,810 BeautifulSoup. 46 00:02:42,811 --> 00:02:43,982 So that's the syntax. 47 00:02:43,983 --> 00:02:47,460 You're importing the BeautifulSoup class from bs4. 48 00:02:48,230 --> 00:02:50,100 If you are on Python 2, 49 00:02:50,533 --> 00:02:52,536 and this should be slightly different. 50 00:02:52,537 --> 00:02:54,766 So you want to import BeautifulSoup directly, 51 00:02:54,767 --> 00:02:56,760 directly like this. 52 00:02:56,766 --> 00:02:58,890 [No Audio] 53 00:02:58,891 --> 00:03:01,980 Okay, Alt+Enter and go to the next line. 54 00:03:01,981 --> 00:03:06,972 So to load a web page, it's good 55 00:03:06,973 --> 00:03:09,286 to create a variable, so you can load 56 00:03:09,287 --> 00:03:12,454 the web page source code to this variable. 57 00:03:12,455 --> 00:03:17,840 So r=requests.get. 58 00:03:17,841 --> 00:03:19,302 So the get method. 59 00:03:19,303 --> 00:03:21,572 So you point to the library, and then to the get 60 00:03:21,573 --> 00:03:26,602 method, and all you need to pass here is the URL 61 00:03:26,603 --> 00:03:28,362 of the webpage that you want to load. 62 00:03:28,363 --> 00:03:35,006 So in this case, http://pythonhow.com/example.html. 63 00:03:35,007 --> 00:03:36,382 So don't forget the html. 64 00:03:36,383 --> 00:03:37,960 This is just a static web page, 65 00:03:37,961 --> 00:03:41,780 so you should pass html there. 66 00:03:42,310 --> 00:03:45,033 Now this should create, 67 00:03:45,034 --> 00:03:47,500 [Author Typing] 68 00:03:47,501 --> 00:03:49,450 a request object. 69 00:03:49,451 --> 00:03:52,010 So we're still not there. 70 00:03:52,011 --> 00:03:54,162 And what you want to do is grab 71 00:03:54,163 --> 00:03:59,632 the content from this request data type, and 72 00:03:59,633 --> 00:04:01,382 maybe store it in another variable. 73 00:04:01,383 --> 00:04:04,646 So the content stored in a c variable 74 00:04:04,647 --> 00:04:08,912 like that, and if you want to check 75 00:04:08,913 --> 00:04:14,022 now, what this c, you'll see that this is a bytes 76 00:04:14,023 --> 00:04:18,480 data type, and you can print it if you want. 77 00:04:18,480 --> 00:04:20,850 [No Audio] 78 00:04:20,851 --> 00:04:25,688 Even though this doesn't look very nice, this is 79 00:04:25,689 --> 00:04:29,700 actually the source code, that you see in here. 80 00:04:30,310 --> 00:04:33,928 So we have the head tags and 81 00:04:33,929 --> 00:04:37,564 the html tags, and everything else there. 82 00:04:37,565 --> 00:04:40,012 And now, here is where the 83 00:04:40,013 --> 00:04:42,920 BeautifulSoup comes into play. 84 00:04:43,450 --> 00:04:47,452 So all the request does is, it loads the 85 00:04:47,453 --> 00:04:50,288 source code of the webpage, but in a 86 00:04:50,289 --> 00:04:53,660 very scrambled form as you see here. 87 00:04:54,270 --> 00:04:58,422 Now if you want to make this beautiful, and extract 88 00:04:58,423 --> 00:05:01,648 the elements and the text and everything out of this 89 00:05:01,649 --> 00:05:05,014 source code, you want to use BeautifulSoup. 90 00:05:05,015 --> 00:05:09,028 So all BeautifulSoup does, is parsing this 91 00:05:09,029 --> 00:05:12,116 source code, and giving you what you want. 92 00:05:12,117 --> 00:05:14,388 So giving you the elements of 93 00:05:14,389 --> 00:05:16,740 the html text, you're interested about. 94 00:05:17,510 --> 00:05:21,288 So you have already loaded this content and 95 00:05:21,289 --> 00:05:22,856 now what you want to do is maybe 96 00:05:22,857 --> 00:05:25,090 create a variable and call it soup. 97 00:05:25,100 --> 00:05:28,070 [Author Typing] 98 00:05:28,071 --> 00:05:31,130 And that would be equal to a BeautifulSoup. 99 00:05:31,131 --> 00:05:33,468 And guess what you want to pass here? 100 00:05:33,469 --> 00:05:38,270 Well, that would be the content, and maybe another argument. 101 00:05:38,271 --> 00:05:43,920 So you want to specify, the parser you want to use 102 00:05:43,921 --> 00:05:49,078 for parsing this data. That is normally the html.parser. 103 00:05:49,079 --> 00:05:51,810 So this is what you want to use. 104 00:05:51,811 --> 00:05:55,818 Almost always, if you don't specify this, you'll 105 00:05:55,819 --> 00:05:58,666 get a warning, but still things will work. 106 00:05:58,948 --> 00:06:02,292 So I normally pass it there and once 107 00:06:02,293 --> 00:06:04,980 you've done that, so execute that cell. 108 00:06:05,750 --> 00:06:13,918 If you now print soup.prettify with empty 109 00:06:13,919 --> 00:06:17,724 brackets there, you'll see the source code of 110 00:06:17,725 --> 00:06:21,640 the webpage in an organized form. 111 00:06:22,170 --> 00:06:26,210 So BeautifulSoup is trained to actually recognize 112 00:06:26,211 --> 00:06:29,184 these tags, and then render them in a 113 00:06:29,185 --> 00:06:31,610 visual way for the human eye. 114 00:06:32,270 --> 00:06:34,998 However, this is just for demonstration. 115 00:06:34,999 --> 00:06:38,048 Normally you'll not have to actually use the 116 00:06:38,049 --> 00:06:43,230 prettify, method a lot, because a better method 117 00:06:43,810 --> 00:06:46,884 to see this code, as I already mentioned 118 00:06:46,885 --> 00:06:49,166 before, is to let me delete the cell. 119 00:06:49,172 --> 00:06:50,300 We don't need that. 120 00:06:50,300 --> 00:06:52,700 So a better way to see that source code is, 121 00:06:52,708 --> 00:06:55,966 to go to your webpage and go to Inspect. 122 00:06:55,967 --> 00:06:58,066 [No Audio] 123 00:06:58,070 --> 00:07:02,800 And here you see a better syntax of the html code. 124 00:07:04,150 --> 00:07:07,333 So here you'll see that, we have 125 00:07:07,333 --> 00:07:11,733 three divisions here, with a cities class. 126 00:07:11,740 --> 00:07:14,666 We have some more divisions here, but 127 00:07:14,668 --> 00:07:16,599 this is what we're interested about. 128 00:07:16,600 --> 00:07:18,666 [No Audio] 129 00:07:18,667 --> 00:07:21,466 So, and the body is everything. 130 00:07:23,070 --> 00:07:26,454 And if you expand one of these divisions, you'll 131 00:07:26,455 --> 00:07:28,800 see that, we have an h2 tag. 132 00:07:29,310 --> 00:07:33,082 So a heading tag, and also a paragraph tag. 133 00:07:33,083 --> 00:07:35,470 So p tag and h2 tags. 134 00:07:36,050 --> 00:07:39,988 And also the other division, which is this one here 135 00:07:39,989 --> 00:07:42,666 has this h2 tag and the paragraph tag. 136 00:07:42,667 --> 00:07:44,900 And Tokyo also has the same thing. 137 00:07:45,670 --> 00:07:49,048 So our duty now is, to 138 00:07:49,049 --> 00:07:53,830 extract the names of these elements. 139 00:07:53,831 --> 00:07:57,788 So that should be, the h2, the text of the 140 00:07:57,789 --> 00:08:00,900 h2 tags, inside the cities tags. 141 00:08:01,530 --> 00:08:05,756 So naturally you start thinking about iterating, through 142 00:08:05,757 --> 00:08:10,368 these boxes, which are actually divisions, so you 143 00:08:10,369 --> 00:08:12,608 want to go through here, here and here 144 00:08:12,609 --> 00:08:14,890 and extract what you want to extract. 145 00:08:16,350 --> 00:08:21,412 So we go back to the code, and what you 146 00:08:21,413 --> 00:08:27,810 want to do is perform a method called find_all. 147 00:08:27,811 --> 00:08:31,630 And what you want to find is divs. 148 00:08:31,966 --> 00:08:38,766 So, divs, but, there may be lots of divs in the webpage. 149 00:08:39,030 --> 00:08:41,539 So, for instance, we have two more divs here. 150 00:08:42,710 --> 00:08:44,824 And we don't want these to be found, 151 00:08:44,825 --> 00:08:46,360 we only want these three. 152 00:08:46,766 --> 00:08:49,692 So, but these three, as you see, they have a 153 00:08:49,693 --> 00:08:54,570 common class attribute, which is equal to cities. 154 00:08:54,571 --> 00:08:56,920 So we want to make use of that. 155 00:08:57,790 --> 00:09:00,752 And we pass here a dictionary, which 156 00:09:00,753 --> 00:09:07,533 should be class equals to cities. 157 00:09:09,230 --> 00:09:12,700 Okay, and let me create a variable here, 158 00:09:12,900 --> 00:09:15,940 and call it all and execute it. 159 00:09:15,941 --> 00:09:18,633 Now, if you print all, 160 00:09:18,634 --> 00:09:22,333 [No Audio] 161 00:09:22,334 --> 00:09:27,620 you'll see that the divisions have been extracted, from the source code. 162 00:09:28,230 --> 00:09:32,500 So from the soup, which was the entire source code. 163 00:09:33,030 --> 00:09:37,100 And I'd like you to actually see closely here. 164 00:09:37,101 --> 00:09:38,332 You can see that the first 165 00:09:38,333 --> 00:09:41,404 division, is divided by comma here. 166 00:09:41,405 --> 00:09:44,146 And then the second division starts up for Paris. 167 00:09:44,147 --> 00:09:48,848 Paris is a second, and it ends here. 168 00:09:48,849 --> 00:09:50,656 And then Tokyo starts here. 169 00:09:50,657 --> 00:09:52,272 So we've got a list with 170 00:09:52,273 --> 00:09:54,810 three elements, one for each division. 171 00:09:55,710 --> 00:09:58,048 Now, if you want to find only the 172 00:09:58,049 --> 00:10:03,498 first element, with this class attribute of cities, 173 00:10:03,499 --> 00:10:08,800 you'd want to use the find methods, all. 174 00:10:10,050 --> 00:10:13,233 So in this case, you don't get a list, but you get the, 175 00:10:14,566 --> 00:10:17,860 code for the division, for the first division only, 176 00:10:18,710 --> 00:10:20,299 which happens to be, 177 00:10:20,300 --> 00:10:23,533 [No Audio] 178 00:10:23,534 --> 00:10:25,830 a tag element of BeautifulSoup. 179 00:10:26,490 --> 00:10:29,788 So it's not a plain string, but it's a 180 00:10:29,789 --> 00:10:33,010 special, let's say a special BeautifulSoup string. 181 00:10:33,011 --> 00:10:36,652 So that BeautifulSoup knows its structure, so it 182 00:10:36,653 --> 00:10:39,968 knows what are elements, so where the tags are 183 00:10:39,969 --> 00:10:41,440 and where the text is and so on. 184 00:10:41,441 --> 00:10:44,066 So that BeautifulSoup is able to give you, 185 00:10:44,067 --> 00:10:47,357 the information that you are looking for. 186 00:10:49,100 --> 00:10:51,170 So, all again. 187 00:10:51,171 --> 00:10:53,194 So you extract the first element. 188 00:10:53,195 --> 00:10:55,818 Now, an alternative way to extract 189 00:10:55,819 --> 00:10:58,030 the first element is logically. 190 00:10:59,250 --> 00:11:05,490 So we have all elements here, is to use list indexing. 191 00:11:07,030 --> 00:11:10,760 So this object that I just showed you, 192 00:11:10,761 --> 00:11:16,418 the tag object of BeautifulSoup supports indexing. 193 00:11:16,419 --> 00:11:18,120 So you execute that. 194 00:11:19,050 --> 00:11:21,300 And in this case, as you see, 195 00:11:22,533 --> 00:11:26,140 you extracted the first item of the tag object. 196 00:11:26,141 --> 00:11:27,820 Or you could do it like this. 197 00:11:27,821 --> 00:11:30,512 So you grab all of them. 198 00:11:30,513 --> 00:11:31,872 So here you have all of them, 199 00:11:31,873 --> 00:11:34,432 and zero is the first one. 200 00:11:34,433 --> 00:11:35,950 You get the idea? 201 00:11:35,951 --> 00:11:38,768 Okay, but what if you want only the 202 00:11:38,769 --> 00:11:43,200 h2 tags, from this div class? 203 00:11:43,970 --> 00:11:48,228 Well, in that case, what you'd want to do is refer to 204 00:11:48,229 --> 00:11:52,880 the all object, and then apply the find_all method again. 205 00:11:54,290 --> 00:11:57,966 And this time, you'd want to get the h2 element. 206 00:11:57,967 --> 00:12:01,518 And in this case, you don't have a class attribute, so you'll 207 00:12:01,519 --> 00:12:04,622 have to leave it like that, and you get an error. 208 00:12:04,623 --> 00:12:07,938 Because, what I did here, is I didn't 209 00:12:07,939 --> 00:12:13,596 point to this division, but I pointed to 210 00:12:13,597 --> 00:12:19,634 actually the list, containing all these divisions. 211 00:12:19,635 --> 00:12:23,024 So Python is trying to get the h2, but 212 00:12:23,025 --> 00:12:26,790 this ResultSet method doesn't have this h2 element. 213 00:12:26,791 --> 00:12:29,184 So what you want to do is, you want to 214 00:12:29,185 --> 00:12:35,268 point to the first element, element of the list and 215 00:12:35,269 --> 00:12:39,090 that gives you the h2 element with a tags 216 00:12:39,091 --> 00:12:43,160 and text, which is like a list. 217 00:12:43,161 --> 00:12:46,680 So you want to perform a zero indexing there. 218 00:12:46,681 --> 00:12:49,288 And if you want London only, you 219 00:12:49,289 --> 00:12:51,766 apply text, and you get London. 220 00:12:51,767 --> 00:12:53,830 [No Audio] 221 00:12:53,831 --> 00:12:55,256 So this is what we wanted, 222 00:12:55,257 --> 00:12:57,110 right, to extract the cities. 223 00:12:57,633 --> 00:13:00,940 So we extracted London. Now, 224 00:13:00,941 --> 00:13:03,750 how about extracting Paris and Tokyo? 225 00:13:04,810 --> 00:13:09,154 Well, as you might guess, we need to use a for loop. 226 00:13:09,155 --> 00:13:12,128 But first, let me summarize what we did here. 227 00:13:12,129 --> 00:13:15,392 So we loaded the content up 228 00:13:15,393 --> 00:13:17,552 here, which is this one here. 229 00:13:17,553 --> 00:13:18,832 And then we loaded this 230 00:13:18,833 --> 00:13:22,298 content, in the BeautifulSoup method. 231 00:13:22,299 --> 00:13:25,924 And BeautifulSoup makes this soup beautiful, 232 00:13:25,925 --> 00:13:28,666 so that it recognizes these tags. 233 00:13:28,667 --> 00:13:32,564 And so what we did then is we found, we 234 00:13:32,565 --> 00:13:38,670 extracted from this content, we extracted all the division elements. 235 00:13:38,671 --> 00:13:41,928 So together with the text, and the attributes and 236 00:13:41,929 --> 00:13:45,986 the text inside them, so everything inside these divisions 237 00:13:45,987 --> 00:13:49,756 with a class equals to cities, then we can 238 00:13:49,757 --> 00:13:52,600 perform for each of these 239 00:13:52,601 --> 00:13:54,433 [No Audio] 240 00:13:54,434 --> 00:13:56,600 elements of this list. 241 00:13:57,130 --> 00:14:00,108 We can perform again a find_all method, so 242 00:14:00,109 --> 00:14:04,533 we can find subtags of these division tags. 243 00:14:04,806 --> 00:14:08,246 And in this case, we found the h2 tags. 244 00:14:08,247 --> 00:14:11,360 And then we grabbed the first item of the list, which in 245 00:14:11,361 --> 00:14:14,762 this case, happened to be a list with only one item. 246 00:14:14,763 --> 00:14:19,066 So each of these divisions have, one h2 tags. 247 00:14:19,067 --> 00:14:21,620 Or alternatively, you could just use find 248 00:14:21,621 --> 00:14:25,030 here and without using this indexing. 249 00:14:25,031 --> 00:14:27,374 But this is a general method. 250 00:14:27,375 --> 00:14:31,032 And then we apply the text attribute there. 251 00:14:31,033 --> 00:14:34,633 So to extract the text out of this element. 252 00:14:34,974 --> 00:14:36,390 So we got London. 253 00:14:36,970 --> 00:14:39,548 Now we need to do the same, 254 00:14:39,549 --> 00:14:41,874 but, in this case by iterating. 255 00:14:41,875 --> 00:14:49,500 So for, let's say, item in all, you want to print out. 256 00:14:51,150 --> 00:14:55,232 So item here is, this one here. 257 00:14:55,233 --> 00:14:57,782 So this would be the first item. 258 00:14:57,783 --> 00:15:04,566 So you want to print out, the item.find_all, 259 00:15:06,133 --> 00:15:12,160 and you want to find the h2 tags from this first item, for example. 260 00:15:12,950 --> 00:15:18,824 So h2 tags. And then you need to apply this zero indexing there 261 00:15:18,825 --> 00:15:22,120 and you want to grab the text from this. 262 00:15:22,121 --> 00:15:23,300 And that's it. 263 00:15:24,470 --> 00:15:26,092 Here are the data. 264 00:15:26,093 --> 00:15:28,652 Alternatively, you could just pass p 265 00:15:28,653 --> 00:15:30,950 here, and you get the paragraphs. 266 00:15:32,650 --> 00:15:35,320 So this one's here, the text. 267 00:15:37,130 --> 00:15:40,454 So that's the idea of loading webpages 268 00:15:40,455 --> 00:15:44,198 in Python, and parsing them with BeautifulSoup 269 00:15:44,199 --> 00:15:49,070 and extracting text out of the webpage. 270 00:15:49,071 --> 00:15:52,058 So, sorry if I was a bit repetitive 271 00:15:52,059 --> 00:15:55,332 in explaining this stuff, but I really want 272 00:15:55,333 --> 00:15:58,750 to make sure you understand the core concepts. 273 00:15:59,570 --> 00:16:01,620 On the other hand, if you found this 274 00:16:01,621 --> 00:16:05,188 very basic, I would say let's move on 275 00:16:05,189 --> 00:16:08,362 to the next lectures, where we'll be extracting 276 00:16:08,363 --> 00:16:11,280 some information from a more advanced website. 277 00:16:11,890 --> 00:16:15,060 And we'll be extracting links and not only text. 278 00:16:15,061 --> 00:16:17,628 So that's a real world program, 279 00:16:17,629 --> 00:16:19,610 and a very interesting one. 280 00:16:19,611 --> 00:16:21,666 So I'll talk to you later.