1 00:00:00,633 --> 00:00:03,840 Great, so let's go to our page. 2 00:00:03,840 --> 00:00:09,200 [No Audio] 3 00:00:09,210 --> 00:00:11,692 So we have searched for Rock Springs there. 4 00:00:11,693 --> 00:00:13,692 And what we want to do now, is 5 00:00:13,693 --> 00:00:16,412 to understand the structure of the web page. 6 00:00:16,413 --> 00:00:18,920 So we use the Inspect tool there. 7 00:00:18,933 --> 00:00:22,666 [No Audio] 8 00:00:22,666 --> 00:00:27,852 And the logic is that you want to 9 00:00:27,853 --> 00:00:31,690 iterate, through all of these boxes. 10 00:00:31,691 --> 00:00:37,812 So to say. So we grab the HTML of these boxes and 11 00:00:37,813 --> 00:00:41,770 then we go through inside of those HTMLs. 12 00:00:41,771 --> 00:00:45,608 So we iterate through those HTMLs, and then we find 13 00:00:45,609 --> 00:00:50,664 the tags, for the data that we want to get. 14 00:00:50,665 --> 00:00:52,728 So that brings us to the point that we should 15 00:00:52,729 --> 00:00:57,430 be looking for the elements that identify these boxes. 16 00:00:58,090 --> 00:01:04,652 So if I go Inspect here again, I'll see that 17 00:01:04,653 --> 00:01:08,620 this whole box has this div with this id. 18 00:01:09,470 --> 00:01:11,696 And actually I want an upper level. 19 00:01:11,697 --> 00:01:14,112 So I want the entire box there. 20 00:01:14,113 --> 00:01:16,640 So if I go here now, this 21 00:01:16,641 --> 00:01:20,766 looks like the entire box here. 22 00:01:20,990 --> 00:01:24,372 So inside here should be the data. 23 00:01:24,373 --> 00:01:28,080 This is the picture and this is a Price. 24 00:01:28,850 --> 00:01:32,840 So, here is the text for the price. 25 00:01:32,841 --> 00:01:36,168 And we have this propPrice, as a class for 26 00:01:36,169 --> 00:01:40,340 the h4 tag, which creates this number here. 27 00:01:40,340 --> 00:01:42,390 [No Audio] 28 00:01:42,391 --> 00:01:46,770 So, this is the first div division. 29 00:01:47,930 --> 00:01:52,410 Then we should have the next div somewhere there, here. 30 00:01:52,411 --> 00:01:55,356 So propertyRow, propertyRow here, 31 00:01:55,357 --> 00:01:58,288 propertyRow, propertyRow again. 32 00:01:58,289 --> 00:02:00,330 So the class, propertyRow. 33 00:02:02,030 --> 00:02:03,420 And again here. 34 00:02:03,433 --> 00:02:06,540 [No Audio] 35 00:02:06,541 --> 00:02:09,900 And let me put this down here. 36 00:02:10,560 --> 00:02:13,218 So docking it at the bottom ,so that we 37 00:02:13,219 --> 00:02:15,266 can see the entire box there, 38 00:02:15,267 --> 00:02:17,460 [No Audio] 39 00:02:17,461 --> 00:02:22,754 like this. So here we go. 40 00:02:22,755 --> 00:02:26,438 So what I would like to scrap and save them in 41 00:02:26,439 --> 00:02:30,268 a CSV file or Excel file with Pandas later, is I'm 42 00:02:30,269 --> 00:02:34,492 going to get the Price, the Address, the number of Beds, 43 00:02:34,493 --> 00:02:40,366 the number of Baths, the Area of the Property, and also 44 00:02:41,433 --> 00:02:45,408 the Lot Size, if there is Lots. 45 00:02:45,409 --> 00:02:48,280 So some properties don't have a Lot there. 46 00:02:48,281 --> 00:02:50,964 So we have a trick there, and you're 47 00:02:50,965 --> 00:02:52,734 going to learn how to crack that. 48 00:02:52,735 --> 00:02:54,302 And so these are the data that 49 00:02:54,303 --> 00:02:57,050 I'm going to grab from each property. 50 00:02:58,060 --> 00:03:00,590 And let me put this up here again. 51 00:03:00,600 --> 00:03:03,200 [No Audio] 52 00:03:03,201 --> 00:03:06,184 And you can either click here and Inspect, 53 00:03:06,185 --> 00:03:09,554 so it goes directly to the elements or 54 00:03:09,555 --> 00:03:15,030 to the Price or go manually, which is 55 00:03:15,031 --> 00:03:18,198 probably better because, it helps you understand better 56 00:03:18,200 --> 00:03:21,200 the website, the structure of the web page. 57 00:03:21,460 --> 00:03:23,350 So this is a Price. 58 00:03:23,351 --> 00:03:26,202 And then down here, inside here should 59 00:03:26,203 --> 00:03:31,968 be the address elements, primaryDetails. 60 00:03:31,969 --> 00:03:33,408 These are the Beds. 61 00:03:33,409 --> 00:03:38,254 So if we expand this, you'll see that 62 00:03:38,255 --> 00:03:40,490 here is the text for the Address. 63 00:03:41,580 --> 00:03:46,410 So it has a span, with this class name. 64 00:03:47,580 --> 00:03:50,626 And this here is the name of the town, the 65 00:03:50,627 --> 00:03:56,114 code for the state and the zip code as well, great. 66 00:03:56,115 --> 00:03:59,282 But first of all, as I said, I need to 67 00:03:59,283 --> 00:04:03,490 go through this propertyRow, class of the div. 68 00:04:04,340 --> 00:04:06,130 So let's do that here. 69 00:04:06,820 --> 00:04:10,742 And you know that you have a method called 70 00:04:10,743 --> 00:04:15,274 find_all, which applies to the soup object. 71 00:04:15,275 --> 00:04:20,026 So find_all, and this will generate a list with all the div 72 00:04:20,027 --> 00:04:28,884 elements that have a class of, what was it, propertyRow? 73 00:04:28,885 --> 00:04:30,600 Yeah, it's propertyRow. 74 00:04:31,980 --> 00:04:37,854 propertyRow with capital R and that's it. 75 00:04:37,855 --> 00:04:41,420 So Alt+Enter, execute that. 76 00:04:41,421 --> 00:04:44,946 And so what can you do? 77 00:04:44,947 --> 00:04:49,790 Well, all, print it out and maybe see what you get. 78 00:04:49,800 --> 00:04:52,340 [No Audio] 79 00:04:52,341 --> 00:04:55,484 So it starts at the very beginning, 80 00:04:55,485 --> 00:04:59,460 of the very first propertyRow division. 81 00:04:59,461 --> 00:05:04,342 So that would be the first price, which 82 00:05:04,343 --> 00:05:07,830 was this one here, up here, here. 83 00:05:07,833 --> 00:05:10,680 [No Audio] 84 00:05:10,681 --> 00:05:12,368 Then there should be a command 85 00:05:12,369 --> 00:05:14,740 here, after the first division ends. 86 00:05:14,740 --> 00:05:17,260 [No Audio] 87 00:05:17,261 --> 00:05:19,950 But anyway, if you don't want to 88 00:05:19,951 --> 00:05:22,590 find that manually, you can do something. 89 00:05:22,591 --> 00:05:25,610 You can find the length of this 90 00:05:26,140 --> 00:05:28,782 all object, which is like a list. 91 00:05:28,783 --> 00:05:30,490 It's not exactly a list. 92 00:05:31,500 --> 00:05:39,120 Actually, it's a ResultSet element of bs4, of the BeautifulSoup library. 93 00:05:39,121 --> 00:05:43,106 But it has a length function, just like List too. 94 00:05:43,107 --> 00:05:47,062 So len(all) and you get 10, and 95 00:05:47,063 --> 00:05:51,090 we have exactly ten results for each page. 96 00:05:51,780 --> 00:05:53,622 So first page here, second page 97 00:05:53,623 --> 00:05:55,510 has ten results and so on. 98 00:05:56,440 --> 00:05:59,792 Now, this is like a list, so it doesn't 99 00:05:59,793 --> 00:06:05,424 have a find_all method, but its elements. 100 00:06:05,425 --> 00:06:07,344 So let's say the first element, 101 00:06:07,345 --> 00:06:10,244 this element has a final method. 102 00:06:10,245 --> 00:06:13,358 So just like you do with BeautifulSoup, so 103 00:06:13,359 --> 00:06:15,900 you applied a method, a final method to BeautifulSoup 104 00:06:15,901 --> 00:06:18,958 to find Tags elements, you can do the 105 00:06:18,959 --> 00:06:22,482 same for the elements of the all List. 106 00:06:22,483 --> 00:06:25,550 So to say, let's call the result, set a list. 107 00:06:25,700 --> 00:06:27,633 And, so 108 00:06:28,080 --> 00:06:36,530 And, that means you can apply a find_all method, to this source code. 109 00:06:38,180 --> 00:06:40,690 So let's look for the Price. 110 00:06:42,340 --> 00:06:45,354 Well, you can go to the Inspect or just 111 00:06:45,355 --> 00:06:50,870 look through here, if that is yeah, this is 112 00:06:51,480 --> 00:06:56,070 not much code, so I found the Price here. 113 00:06:56,860 --> 00:07:01,486 So we have the h4 tags, h4, and 114 00:07:01,487 --> 00:07:06,200 it also has a propertyPrice, a propPrice class. 115 00:07:06,540 --> 00:07:08,142 So let's pass that. 116 00:07:08,143 --> 00:07:11,742 You can choose not to pass that, but the problem 117 00:07:11,743 --> 00:07:14,226 you may run into, if you don't pass the class 118 00:07:14,227 --> 00:07:17,416 name, is that if you have other h4 tags 119 00:07:17,417 --> 00:07:20,562 in the code, Python will extract them as well. 120 00:07:20,563 --> 00:07:24,134 So you need to specify which h4 we want. 121 00:07:24,135 --> 00:07:30,210 So propPrice there find, and here is a price. 122 00:07:30,820 --> 00:07:34,850 So this is a Price, but with the tax as well. 123 00:07:36,660 --> 00:07:40,314 And this is actually a list, as you see. 124 00:07:40,315 --> 00:07:45,818 Now, because we have only one price for each property, in 125 00:07:45,819 --> 00:07:49,560 this case, we are allowed to use the find method. 126 00:07:50,300 --> 00:07:53,012 So that would give us not results 127 00:07:53,013 --> 00:07:57,240 at least, but the actual tag element. 128 00:07:57,900 --> 00:08:01,186 That means we can now apply a text object 129 00:08:01,187 --> 00:08:06,274 here, and we get that funky string there. 130 00:08:06,275 --> 00:08:10,530 So things are not that simple in real life, as you see. 131 00:08:10,531 --> 00:08:15,000 But luckily, all this object is actually 132 00:08:15,001 --> 00:08:19,900 [No Audio] 133 00:08:19,901 --> 00:08:22,466 a string, so it's a plain Python string. 134 00:08:22,924 --> 00:08:26,870 That means you can apply string methods to that object. 135 00:08:27,640 --> 00:08:31,510 So let me Ctrl+Z to remove the type. 136 00:08:31,510 --> 00:08:33,559 [No Audio] 137 00:08:33,560 --> 00:08:35,946 So in this case, what we want to apply 138 00:08:35,947 --> 00:08:40,200 is, we want to remove all these characters. 139 00:08:41,580 --> 00:08:47,204 And a way to do that, that I think about is to replace. 140 00:08:47,205 --> 00:08:54,386 So you want to replace the \n, with nothing. 141 00:08:54,387 --> 00:08:56,952 So just pass an empty string 142 00:08:56,953 --> 00:08:59,874 there, and see what you get. 143 00:08:59,875 --> 00:09:03,940 Okay, these guys of this century21 website 144 00:09:03,941 --> 00:09:07,170 have decided to make our life difficult. 145 00:09:07,780 --> 00:09:13,036 But, how about applying? 146 00:09:13,037 --> 00:09:16,986 So we've got some space there white space, as you see. 147 00:09:16,987 --> 00:09:19,818 So we replace the white space with 148 00:09:19,819 --> 00:09:23,626 nothing and we get the actual string, great. 149 00:09:23,627 --> 00:09:26,282 So we sort of know that things 150 00:09:26,283 --> 00:09:28,542 are working well, at this point. 151 00:09:28,543 --> 00:09:31,700 And for now, I'm just printing out the results. 152 00:09:31,701 --> 00:09:34,846 So, as I already mentioned, it's good to first 153 00:09:34,847 --> 00:09:39,358 use print statements, when you're building your programs, and 154 00:09:39,359 --> 00:09:43,672 then you replace those print statements with other functions 155 00:09:43,673 --> 00:09:46,248 that you want to use for this data, you're 156 00:09:46,249 --> 00:09:48,818 getting or other objects that you're working with. 157 00:09:48,819 --> 00:09:52,018 So in our example here, later on we will 158 00:09:52,019 --> 00:09:56,972 be adding some Pandas methods, to grab these values 159 00:09:56,973 --> 00:09:59,180 and send them to a CSV file. 160 00:09:59,181 --> 00:10:01,974 So that's one of the first thing, I would like to say. 161 00:10:01,975 --> 00:10:05,066 The second thing is that, here now we 162 00:10:05,067 --> 00:10:07,733 grab the value of the first element. 163 00:10:08,280 --> 00:10:10,784 Now we start to think about being efficient. 164 00:10:10,785 --> 00:10:15,680 So we need to instead of going and extracting 165 00:10:15,681 --> 00:10:20,110 all the other elements like the property address and 166 00:10:20,111 --> 00:10:23,758 property state and so on, maybe it's good to 167 00:10:23,759 --> 00:10:26,932 actually start building our for loop. 168 00:10:26,933 --> 00:10:29,982 So we now know that individual values are 169 00:10:29,983 --> 00:10:33,218 being extracted correctly, but now we want to 170 00:10:33,219 --> 00:10:35,602 make sure that a loop, that iterates through 171 00:10:35,603 --> 00:10:38,722 all these properties is also working. 172 00:10:38,723 --> 00:10:40,888 And let's go ahead and start writing 173 00:10:40,889 --> 00:10:42,504 the loop in the other lecture. 174 00:10:42,505 --> 00:10:46,134 But for now, let's actually organize this code. 175 00:10:46,135 --> 00:10:48,960 So here is a trick you can use in Jupyter. 176 00:10:50,020 --> 00:10:53,532 So I go to the first cell and I'm in command mode. 177 00:10:53,533 --> 00:10:56,336 So you press Escape to go to the command mode 178 00:10:56,337 --> 00:11:00,310 and Shift+J, you select the other cell. 179 00:11:00,840 --> 00:11:03,194 J again, select the other cell. 180 00:11:03,195 --> 00:11:07,066 Or you can go up with K, so J, J, J. 181 00:11:07,320 --> 00:11:09,498 And what I want to do now, is I want 182 00:11:09,499 --> 00:11:12,378 to merge all these cells in one single cell. 183 00:11:12,379 --> 00:11:15,690 And to do that with Shift pressed, you press M. 184 00:11:15,691 --> 00:11:17,818 So Shift+M and you get all 185 00:11:17,819 --> 00:11:20,770 the cells, merged into one single cell. 186 00:11:20,771 --> 00:11:24,162 So this is more an issue of preference, but 187 00:11:24,163 --> 00:11:26,533 it's good to have a clean notebook there. 188 00:11:26,882 --> 00:11:28,990 So this is what we did so far. 189 00:11:29,760 --> 00:11:33,500 Let's go ahead and run the loop in other lecture, see you.