1 00:00:00,170 --> 00:00:03,716 Hey there, I'm glad if you're watching this. 2 00:00:03,717 --> 00:00:05,732 It's great that you have made it this far 3 00:00:05,733 --> 00:00:09,242 in the course. And in this lecture, actually throughout 4 00:00:09,243 --> 00:00:12,740 the few, the next few lectures, you'll learn how 5 00:00:12,741 --> 00:00:15,680 to scrap data from this website. 6 00:00:16,290 --> 00:00:20,068 So this is a real estate website and what it 7 00:00:20,069 --> 00:00:24,906 does, it lists properties for sale or for rent. 8 00:00:24,907 --> 00:00:29,350 So basically let's say Rock Springs. 9 00:00:30,010 --> 00:00:31,442 There are actually a few Rock 10 00:00:31,443 --> 00:00:33,522 Springs there, so let's say Wyoming, 11 00:00:33,523 --> 00:00:35,026 Rock Springs in Wyoming. 12 00:00:35,027 --> 00:00:37,466 So we are looking for some properties there 13 00:00:37,467 --> 00:00:55,233 [No Audio] 14 00:00:55,234 --> 00:01:00,010 in Wyoming, and it says it found 28 listings. 15 00:01:00,011 --> 00:01:03,066 So it's quite a small city, small town. 16 00:01:03,067 --> 00:01:07,288 So the idea is that you learn how to scrap data 17 00:01:07,289 --> 00:01:11,700 of each of these properties, so which can be price, 18 00:01:13,033 --> 00:01:15,838 you have the address there and the number of beds 19 00:01:15,839 --> 00:01:19,066 that the property has and baths and so on. 20 00:01:19,530 --> 00:01:22,274 And you also get the square feet 21 00:01:22,275 --> 00:01:25,532 of the properties, if that is available. 22 00:01:25,533 --> 00:01:28,988 Some properties don't have that, so we 23 00:01:28,989 --> 00:01:30,492 have to count for that too. 24 00:01:30,493 --> 00:01:35,782 And also, you will scrap data from multiple pages. 25 00:01:35,783 --> 00:01:39,712 So we have ten properties here in this first 26 00:01:39,713 --> 00:01:41,632 page and then in the next page we have 27 00:01:41,633 --> 00:01:44,416 ten other, and then in the third, the last 28 00:01:44,417 --> 00:01:47,892 page, we have the rest which should be eight. 29 00:01:47,893 --> 00:01:49,620 So 28 in total. 30 00:01:49,621 --> 00:01:52,196 Now normally, I assume that you know about 31 00:01:52,197 --> 00:01:55,082 the requests and the BeautifulSoup libraries. 32 00:01:55,083 --> 00:01:57,288 So you should have taken the previous lectures, where 33 00:01:57,289 --> 00:02:01,160 we scrapped data from a simple web page. 34 00:02:01,161 --> 00:02:03,896 So that was a trivial example and I believe 35 00:02:03,897 --> 00:02:09,032 after that example, you end up with that, now what? 36 00:02:09,033 --> 00:02:13,644 So for that reason, I want you to 37 00:02:13,645 --> 00:02:16,508 learn how to scrap some real data. 38 00:02:16,509 --> 00:02:18,492 So this is one of the real world 39 00:02:18,493 --> 00:02:20,716 programs, that we are building in this course. 40 00:02:20,717 --> 00:02:23,056 And in the script that we are about to write, 41 00:02:23,057 --> 00:02:27,072 you face some real programming issues there, which is very 42 00:02:27,073 --> 00:02:30,890 important to build up your skills, your Python skills. 43 00:02:31,710 --> 00:02:35,428 Just one issue, before scrapping data from a 44 00:02:35,429 --> 00:02:38,522 website, it's good to read the data policies 45 00:02:38,523 --> 00:02:40,868 of that website, so they may have some 46 00:02:40,869 --> 00:02:44,564 policies against using or getting their data. 47 00:02:44,565 --> 00:02:47,460 I'm using this for educational purposes, so 48 00:02:47,461 --> 00:02:50,264 I believe you'd be doing the same. 49 00:02:50,265 --> 00:02:52,020 So that shouldn't be a problem. 50 00:02:52,630 --> 00:02:56,340 So let's go ahead and write it program. 51 00:02:56,950 --> 00:02:59,670 And I'll be using the Jupyter notebook. 52 00:03:01,210 --> 00:03:03,160 I suggest you do the same. 53 00:03:03,166 --> 00:03:06,810 [Author Typing] 54 00:03:06,811 --> 00:03:10,514 So this will create jupyter notebook file. 55 00:03:10,515 --> 00:03:17,820 So I'm using Python 3, call this century21. 56 00:03:17,833 --> 00:03:20,590 [No Audio] 57 00:03:20,591 --> 00:03:24,612 Great, so you now know that the very first thing 58 00:03:24,613 --> 00:03:27,396 you want to do when writing a program is 59 00:03:27,397 --> 00:03:31,620 maybe import, the libraries that you'll be using. 60 00:03:31,621 --> 00:03:38,200 So you'll be using requests and BeautifulSoup, so from bs4 import, 61 00:03:38,201 --> 00:03:40,310 [Author Typing] 62 00:03:40,311 --> 00:03:44,712 great. Now go to the next line and 63 00:03:44,713 --> 00:03:46,180 let's go back to the website. 64 00:03:47,190 --> 00:03:51,460 So now the first thing you may want to think about is, 65 00:03:51,990 --> 00:03:56,050 how do you load the source code of the web pages? 66 00:03:56,870 --> 00:04:00,136 Now actually this is a bit complicated, not much 67 00:04:00,137 --> 00:04:04,720 complicated, but it's different from the static web page 68 00:04:04,721 --> 00:04:07,850 that we scrapped in the previous lectures. 69 00:04:08,670 --> 00:04:11,152 So here we are going to be 70 00:04:11,153 --> 00:04:13,600 scrapping three pages, as I said. 71 00:04:13,601 --> 00:04:15,972 And the good thing is that, every one 72 00:04:15,973 --> 00:04:18,750 of these pages has a unique URL. 73 00:04:19,033 --> 00:04:24,499 So, when you are in the main page here, 74 00:04:24,770 --> 00:04:27,496 you see the URL is just plain simple. 75 00:04:27,497 --> 00:04:32,078 But then when you search place, so look at the URL 76 00:04:32,079 --> 00:04:34,820 now, when I search, the URL will change. 77 00:04:35,350 --> 00:04:40,098 So it went to real-estate/rock-springs and Wyoming. 78 00:04:40,099 --> 00:04:44,556 So this is the string for the place 79 00:04:44,557 --> 00:04:48,898 that you search, rock Springs and wy. 80 00:04:48,899 --> 00:04:50,732 And you also got something else here 81 00:04:50,733 --> 00:04:53,392 that you need to be aware of. 82 00:04:53,393 --> 00:04:58,214 So the idea is that, now you get this URL. 83 00:04:58,215 --> 00:05:01,552 So first we'll be scrapping the first page only. 84 00:05:01,553 --> 00:05:05,524 And once we grab that, then we think about the 85 00:05:05,525 --> 00:05:09,588 next pages, because the next pages are the same. 86 00:05:09,589 --> 00:05:12,850 So the structure is the same, but we'll just be 87 00:05:12,851 --> 00:05:16,030 writing a loop to iterate through the next pages. 88 00:05:16,710 --> 00:05:19,140 So let's go ahead and load the first page. 89 00:05:19,910 --> 00:05:25,200 Let's say requests.get, here is the URL 90 00:05:25,201 --> 00:05:29,545 [No Audio] 91 00:05:29,546 --> 00:05:33,080 and we want the content of this request object. 92 00:05:33,930 --> 00:05:38,588 So r.content and let's print it out. 93 00:05:38,589 --> 00:05:40,200 So a simple test there. 94 00:05:40,666 --> 00:05:42,100 Hmm, Hmm. 95 00:05:42,101 --> 00:05:44,990 [No Audio] 96 00:05:44,991 --> 00:05:48,832 Alright, so my Internet connection is working. 97 00:05:48,833 --> 00:05:51,936 That's all we know about this 98 00:05:51,937 --> 00:05:53,968 code because we can't read it. 99 00:05:53,969 --> 00:05:56,928 So what we want to do is, go to 100 00:05:56,929 --> 00:06:00,794 the next cell and make that code more readable. 101 00:06:00,795 --> 00:06:04,333 So we need to use a BeautifulSoup library here, 102 00:06:04,334 --> 00:06:07,166 [Author Typing] 103 00:06:07,167 --> 00:06:12,333 c and the parser, which is html. parser. 104 00:06:12,334 --> 00:06:14,400 [No Audio] 105 00:06:14,401 --> 00:06:20,230 And maybe print out soup.prettify. 106 00:06:21,530 --> 00:06:26,120 Now let's see, so this is the page. 107 00:06:27,770 --> 00:06:31,292 Sometimes you may get kicked out of the web page. 108 00:06:31,293 --> 00:06:33,500 So you may want to make sure, 109 00:06:34,030 --> 00:06:37,494 that the page has loaded correctly. 110 00:06:37,495 --> 00:06:42,800 So maybe I could go here and search for something. 111 00:06:42,966 --> 00:06:46,700 So Winchester, I'll search for Winchester here. 112 00:06:46,701 --> 00:06:48,700 [No Audio] 113 00:06:48,701 --> 00:06:52,460 Yeah, so the page seems to have loaded correctly 114 00:06:52,830 --> 00:06:56,066 and we don't need the prettify there. 115 00:06:56,067 --> 00:06:58,808 So let me clean the notebook there. 116 00:06:58,809 --> 00:07:00,558 So we were able to load 117 00:07:00,559 --> 00:07:02,782 the page correctly with requests. 118 00:07:02,783 --> 00:07:03,976 And now what is next? 119 00:07:03,977 --> 00:07:05,960 Well, next is we need to understand 120 00:07:05,961 --> 00:07:08,040 the structure of the web page. 121 00:07:08,041 --> 00:07:12,000 So we need to use the inspect tool in our browser. 122 00:07:12,190 --> 00:07:15,200 And we'll do that in the next lecture, so see you.