1 00:00:00,000 --> 00:00:01,715 [No audio] 2 00:00:01,740 --> 00:00:04,500 And yeah, let's close this section on 3 00:00:04,500 --> 00:00:07,200 Pandas by looking at a real world 4 00:00:07,230 --> 00:00:10,170 example, and in this example, what we're 5 00:00:10,170 --> 00:00:13,050 going to do is, we're going to grab the 6 00:00:13,260 --> 00:00:15,810 Address of each of these rows that we 7 00:00:15,810 --> 00:00:18,930 have in the DataFrame, and we're going to 8 00:00:18,930 --> 00:00:21,780 convert it to a latitude and longitude 9 00:00:21,780 --> 00:00:24,900 coordinates. So geographic coordinates, 10 00:00:24,900 --> 00:00:27,720 in other words. Now addresses, they 11 00:00:27,720 --> 00:00:31,320 define a unique point on the earth. So 12 00:00:31,320 --> 00:00:35,100 if you pass this address to some 13 00:00:35,100 --> 00:00:37,530 service, let's say you throw it on 14 00:00:37,530 --> 00:00:39,750 Google Maps and Google Maps will point 15 00:00:39,750 --> 00:00:41,760 to you. So it will generate a marker 16 00:00:41,850 --> 00:00:43,320 that will tell you where this point is 17 00:00:43,320 --> 00:00:45,480 located. Now, on the big ground that 18 00:00:45,480 --> 00:00:48,360 marker actually has a pair of latitude 19 00:00:48,360 --> 00:00:50,850 and longitude coordinates, and every 20 00:00:50,850 --> 00:00:53,010 point on Earth has these pair of 21 00:00:53,010 --> 00:00:55,890 coordinates, and if you followed the 22 00:00:55,918 --> 00:00:58,228 section on building web maps with Folium 23 00:00:58,260 --> 00:01:00,480 in the course, and then you know how, 24 00:01:00,510 --> 00:01:03,330 what these latitudes and longitudes, how 25 00:01:03,330 --> 00:01:05,550 they come in hand, when you build 26 00:01:05,580 --> 00:01:08,850 applications, such as web maps, or other 27 00:01:08,850 --> 00:01:11,910 maps. Now the process to convert from 28 00:01:11,910 --> 00:01:14,220 addresses to coordinate, it's called 29 00:01:14,220 --> 00:01:16,980 geocoding, and if you want to convert 30 00:01:16,980 --> 00:01:20,040 from latitude and longitude to addresses, 31 00:01:20,370 --> 00:01:23,010 that is called the reverse geocoding. In 32 00:01:23,010 --> 00:01:24,420 this lecture, how we're going to be 33 00:01:24,420 --> 00:01:27,390 looking at geocoding? So basically, what 34 00:01:27,390 --> 00:01:30,420 we're going to do is we'll add here a 35 00:01:30,420 --> 00:01:33,270 column to the DataFrame, actually two 36 00:01:33,270 --> 00:01:35,250 columns, one for latitude, and one for 37 00:01:35,250 --> 00:01:38,490 longitude, for each of the rows. Now, 38 00:01:38,490 --> 00:01:41,100 pandas cannot do that directly. So you 39 00:01:41,100 --> 00:01:43,500 need the help of another library that is 40 00:01:43,525 --> 00:01:48,744 called geopy, and you can install geopy 41 00:01:49,183 --> 00:01:54,049 with pip. So pip install and geopy 42 00:01:55,140 --> 00:02:00,570 and just wait a while. Oh, great, and then 43 00:02:01,500 --> 00:02:03,810 you know, before going ahead and applying 44 00:02:03,835 --> 00:02:07,615 the geocoder to my DataFrame values, 45 00:02:08,010 --> 00:02:10,560 I'd like to actually convert a single 46 00:02:10,805 --> 00:02:14,055 address, an address string with geocoder. 47 00:02:14,518 --> 00:02:16,440 Something you should be aware of 48 00:02:16,440 --> 00:02:19,530 is that to use geopy, actually to use 49 00:02:19,530 --> 00:02:21,900 the geocoder, which is if you say 50 00:02:21,925 --> 00:02:27,300 import geopy, and if you say 51 00:02:27,360 --> 00:02:31,230 geopy, like that, you'll see that 52 00:02:31,255 --> 00:02:34,855 you have a geocoder module among them, 53 00:02:35,495 --> 00:02:37,595 which is this one here, it say geocoders 54 00:02:37,620 --> 00:02:41,580 actually, and for the geocoders to 55 00:02:41,580 --> 00:02:44,070 work, you need an internet connection, 56 00:02:44,405 --> 00:02:46,625 because what geocoders will do, it 57 00:02:46,650 --> 00:02:48,810 will get your address and then it will 58 00:02:48,810 --> 00:02:52,680 send that to an online service that has 59 00:02:52,680 --> 00:02:55,410 all of these addresses in a database, and 60 00:02:55,410 --> 00:02:58,650 then for your address, it will calculate 61 00:02:58,650 --> 00:03:00,330 the corresponding latitude and longitude 62 00:03:00,330 --> 00:03:01,890 values. So you need an internet 63 00:03:01,890 --> 00:03:06,300 connection, and yeah, what you normally 64 00:03:06,325 --> 00:03:07,521 do is, you know, 65 00:03:07,545 --> 00:03:09,700 [No audio] 66 00:03:09,725 --> 00:03:11,575 I want to import from 67 00:03:11,945 --> 00:03:16,505 geopy.geocoders import, actually, 68 00:03:16,530 --> 00:03:18,124 there are a few 69 00:03:18,149 --> 00:03:20,245 [Author typing] 70 00:03:20,270 --> 00:03:23,084 geocoders there, but we will use Nominatim. 71 00:03:23,108 --> 00:03:31,535 [No audio] 72 00:03:31,560 --> 00:03:33,600 And then what to do is, you know, you 73 00:03:33,600 --> 00:03:39,090 create a nominating variable object. So 74 00:03:39,090 --> 00:03:40,650 you store that object in a variable. 75 00:03:40,674 --> 00:03:46,864 [No audio] 76 00:03:46,889 --> 00:03:49,319 And once you have that object, and then you 77 00:03:49,319 --> 00:03:52,589 points to the geocode method of the 78 00:03:52,619 --> 00:03:56,099 nominating object, and you pass an 79 00:03:56,099 --> 00:03:59,459 address as a string in there. Let's say 80 00:03:59,489 --> 00:04:02,502 395 23rd, 81 00:04:02,527 --> 00:04:04,630 [No audio] 82 00:04:04,661 --> 00:04:06,065 and then maybe the City, 83 00:04:07,613 --> 00:04:11,704 and the zip code 94114, and if you 84 00:04:11,762 --> 00:04:17,759 execute that, you get a Location datatype there, 85 00:04:18,154 --> 00:04:20,914 say all of this. What that includes is you 86 00:04:20,939 --> 00:04:22,679 know, it includes the address that you 87 00:04:22,679 --> 00:04:26,159 pass there, so this one, and it has also 88 00:04:26,159 --> 00:04:28,139 added United States of America, so the 89 00:04:28,139 --> 00:04:29,520 Country in here, 90 00:04:29,544 --> 00:04:31,693 [No audio] 91 00:04:31,718 --> 00:04:32,819 and you also get the 92 00:04:32,819 --> 00:04:36,449 latitude and longitude, and this one here, 93 00:04:36,479 --> 00:04:38,637 just ignore that, this is a response 94 00:04:38,662 --> 00:04:41,369 from the geocoder. So it doesn't 95 00:04:41,369 --> 00:04:44,909 mean much. Sometimes, they will, it's 96 00:04:44,934 --> 00:04:47,574 rare, but sometimes you may get a None 97 00:04:47,609 --> 00:04:50,099 object. So for instance, if you pass 98 00:04:50,099 --> 00:04:55,259 this address, which probably is not a 99 00:04:55,259 --> 00:04:57,555 real address, I'm not sure about that. I don't know. 100 00:04:57,844 --> 00:05:00,887 But if you say San Francisco, 101 00:05:00,917 --> 00:05:03,364 [No audio] 102 00:05:03,389 --> 00:05:04,490 CA 103 00:05:05,224 --> 00:05:09,934 94119, if you execute that, nothing will 104 00:05:09,959 --> 00:05:12,839 happen, and actually, you can see that 105 00:05:13,445 --> 00:05:17,015 if you store these in a variable, and then print 106 00:05:17,719 --> 00:05:22,144 n, this will say that it's a None 107 00:05:22,169 --> 00:05:23,789 object, so it doesn't have anything 108 00:05:23,789 --> 00:05:26,309 inside. So yeah, be aware of these 109 00:05:26,339 --> 00:05:29,399 scenarios as well, and yeah, we had our 110 00:05:29,459 --> 00:05:31,781 working address, this one here. 111 00:05:32,663 --> 00:05:38,134 Let's store that in these variable, and once 112 00:05:38,159 --> 00:05:40,739 you have that to extract the latitude 113 00:05:42,329 --> 00:05:44,879 and longitude, you apply latitude 114 00:05:44,904 --> 00:05:49,045 for the latitude value and the longitude, for longitude, 115 00:05:49,069 --> 00:05:51,069 [No audio] 116 00:05:51,094 --> 00:05:52,275 and that should do it. 117 00:05:52,534 --> 00:05:56,194 Because you know, n, type n, n is a 118 00:05:56,219 --> 00:05:58,259 special object, it's called a location 119 00:05:58,259 --> 00:06:01,079 object of geopy. So you need to apply 120 00:06:01,079 --> 00:06:03,659 those methods, and yeah, that's how you 121 00:06:03,659 --> 00:06:06,509 convert an address string to a 122 00:06:06,509 --> 00:06:08,609 location, or to latitude and longitude 123 00:06:08,609 --> 00:06:10,859 values. But how about converting an 124 00:06:10,859 --> 00:06:14,039 entire column of a DataFrame into 125 00:06:14,039 --> 00:06:18,029 latitude and longitude? So we've got 126 00:06:18,029 --> 00:06:23,962 this DataFrame, df equals to pandas.read_csv, 127 00:06:23,986 --> 00:06:30,449 say super.csv, and this should 128 00:06:30,449 --> 00:06:34,204 be an underscore, and let me import pandas first, 129 00:06:34,265 --> 00:06:36,841 [No audio] 130 00:06:36,885 --> 00:06:38,278 and print out the DataFrame. 131 00:06:38,940 --> 00:06:40,968 So this is our new DataFrame. 132 00:06:41,194 --> 00:06:42,724 Actually, this is the old one that 133 00:06:42,749 --> 00:06:45,569 we've been using, we have five, six 134 00:06:45,569 --> 00:06:48,719 addresses there, 6 rows with an address, 135 00:06:48,749 --> 00:06:53,909 a city, and state, and country. And now of a 136 00:06:53,909 --> 00:06:56,699 geocode method more or less, it accepts 137 00:06:57,029 --> 00:07:00,539 this kind of format. So it expects from 138 00:07:00,539 --> 00:07:04,619 you the road name in here, and then 139 00:07:04,619 --> 00:07:08,849 the city, the zip code in here, and the 140 00:07:08,874 --> 00:07:12,359 country. So what we can do is we need to 141 00:07:12,359 --> 00:07:16,855 construct such a column in our DataFrame first, 142 00:07:16,884 --> 00:07:18,064 and yeah, you can either 143 00:07:18,089 --> 00:07:20,819 create a new column, or you can add it 144 00:07:20,819 --> 00:07:23,249 to an existing one. So let's say I've added 145 00:07:23,249 --> 00:07:26,069 the address, existing Address column. So 146 00:07:26,069 --> 00:07:30,359 that will be equal to df Address. So 147 00:07:30,384 --> 00:07:35,459 these value in here, plus, well, I'll 148 00:07:35,459 --> 00:07:38,939 need a comma in there. So a comma and maybe 149 00:07:38,939 --> 00:07:44,854 a space and plus df City, so a comma 150 00:07:44,884 --> 00:07:47,404 between Address and City, and then 151 00:07:47,429 --> 00:07:53,789 another comma and then plus df State 152 00:07:53,789 --> 00:07:59,369 again, and, yeah, yet another comma like that, 153 00:08:00,705 --> 00:08:04,965 plus, again, df, and lastly, Country. 154 00:08:06,521 --> 00:08:08,221 That should do it. So df. 155 00:08:08,245 --> 00:08:10,607 [No audio] 156 00:08:10,632 --> 00:08:11,944 And yeah, we've got 157 00:08:11,969 --> 00:08:14,099 a complete Address column in there. 158 00:08:15,329 --> 00:08:18,929 Great, and now we need to send this 159 00:08:18,929 --> 00:08:25,409 string to the geocode method, and we need 160 00:08:25,409 --> 00:08:28,859 to do it for all the rows. Now you're 161 00:08:28,859 --> 00:08:31,139 probably thinking of iterating but with 162 00:08:31,139 --> 00:08:33,149 pandas actually, you don't need to 163 00:08:33,149 --> 00:08:36,569 iterate. Pandas is designed in a way 164 00:08:36,600 --> 00:08:39,060 that it allows you, it has some methods 165 00:08:39,419 --> 00:08:43,589 that allows you to apply a method or a 166 00:08:43,589 --> 00:08:46,855 function to all the rows of the DataFrame 167 00:08:46,881 --> 00:08:48,779 without having to write a for 168 00:08:48,779 --> 00:08:53,729 loop, and to do that, you know, you'll need 169 00:08:53,729 --> 00:08:57,329 to create a new column. Let's call it 170 00:08:57,329 --> 00:08:59,909 coordinates where you store the string. 171 00:09:01,529 --> 00:09:04,109 You know this list string in here. 172 00:09:05,164 --> 00:09:07,204 So this, actually, it's not a string, 173 00:09:07,229 --> 00:09:09,839 it's a location object, but you can 174 00:09:09,839 --> 00:09:12,269 store it in your DataFrame. So we need 175 00:09:12,269 --> 00:09:15,059 to store locations for each of the rows. 176 00:09:15,449 --> 00:09:17,159 And the way you do that is, you know, 177 00:09:17,159 --> 00:09:21,479 you points to the column that you want 178 00:09:21,479 --> 00:09:25,949 to pass to your geocoder and then use a 179 00:09:25,949 --> 00:09:29,339 pandas method called apply. So what 180 00:09:29,939 --> 00:09:31,649 method do you want to apply to the 181 00:09:31,649 --> 00:09:34,637 values or the Address column? Well, that'll 182 00:09:34,667 --> 00:09:41,854 be n. So n is Nominatim object 183 00:09:41,884 --> 00:09:45,574 that we have here. Oh, sorry, it's nom 184 00:09:45,637 --> 00:09:52,281 sorry, so that will be nom.geocode, 185 00:09:53,642 --> 00:09:58,564 and so the same as this one. So 186 00:09:58,589 --> 00:10:01,499 nom.geocode. But in this case, you 187 00:10:01,499 --> 00:10:04,019 don't pass brackets there, because of 188 00:10:04,019 --> 00:10:06,299 the apply method who will do it for you. 189 00:10:06,874 --> 00:10:08,884 So just like that, and then you, maybe 190 00:10:08,909 --> 00:10:11,279 you print all the dateframe in there 191 00:10:12,046 --> 00:10:13,191 and see what you get. 192 00:10:13,215 --> 00:10:18,424 [No audio] 193 00:10:18,449 --> 00:10:21,659 I got a service timed out. geocoder is 194 00:10:21,659 --> 00:10:23,489 not working properly, maybe I have a 195 00:10:23,519 --> 00:10:25,409 problem with my internet connection. So 196 00:10:25,409 --> 00:10:28,349 if you get this long error, that is not 197 00:10:28,349 --> 00:10:31,321 your fault, it is a problem with geocoder 198 00:10:31,345 --> 00:10:34,229 with geopy. I'll try that again. 199 00:10:34,254 --> 00:10:38,014 [No audio] 200 00:10:38,039 --> 00:10:40,859 And yeah, this time it worked. It was able to 201 00:10:40,859 --> 00:10:44,999 fetch the location objects in here. I 202 00:10:44,999 --> 00:10:46,559 mean, you cannot see the latitude and 203 00:10:46,559 --> 00:10:48,989 longitude, because it's a long string. 204 00:10:49,834 --> 00:10:51,188 But if you do it like that, 205 00:10:51,212 --> 00:10:53,523 [Author typing] 206 00:10:53,548 --> 00:10:56,129 Coordinates that you get the series for 207 00:10:56,129 --> 00:11:00,449 coordinates, and, yeah, that is not 208 00:11:00,449 --> 00:11:04,379 showing it either, and but you can do it 209 00:11:04,379 --> 00:11:07,469 like, you know, df.Coordinates, and then 210 00:11:07,469 --> 00:11:10,439 you axes, the first item on it, like 211 00:11:10,439 --> 00:11:13,379 that, and then you get the entire text 212 00:11:13,404 --> 00:11:17,484 for the location. If you're on the latitude, 213 00:11:18,498 --> 00:11:19,580 you get latitude only. 214 00:11:19,610 --> 00:11:21,610 [No audio] 215 00:11:21,637 --> 00:11:23,554 And that brings us to the point that 216 00:11:23,579 --> 00:11:25,229 you may want now to 217 00:11:25,229 --> 00:11:27,588 add another two columns in your DataFrame, 218 00:11:27,612 --> 00:11:29,639 where you fetch the latitude and 219 00:11:29,639 --> 00:11:33,089 the longitude values. So our DataFrame 220 00:11:33,114 --> 00:11:37,169 is this one at the moment, and what you 221 00:11:37,169 --> 00:11:38,863 could do is, you know, you could create 222 00:11:38,887 --> 00:11:40,887 [Author typing] 223 00:11:40,903 --> 00:11:43,379 a Latitude column in there, that would 224 00:11:43,404 --> 00:11:49,206 be equal to df.Coordinates.apply. 225 00:11:49,230 --> 00:11:51,766 [No audio] 226 00:11:51,817 --> 00:11:55,645 You know, you cannot apply latitude directly in there. 227 00:11:55,675 --> 00:11:58,037 [No audio] 228 00:11:58,062 --> 00:12:00,089 Because you get this kind of error that 229 00:12:00,089 --> 00:12:02,669 says Series has no attribute latitude. 230 00:12:03,149 --> 00:12:05,369 So you're applying latitude methods to a 231 00:12:05,369 --> 00:12:07,739 series. But a series doesn't recognize 232 00:12:07,739 --> 00:12:11,879 that. What series recognizes is the apply 233 00:12:11,879 --> 00:12:15,591 method. So there you can write your other 234 00:12:15,621 --> 00:12:18,904 methods. Now latitude will 235 00:12:18,934 --> 00:12:22,894 point to the values of these 236 00:12:22,919 --> 00:12:26,009 Coordinates column, and in such 237 00:12:26,009 --> 00:12:29,009 scenario, you use a lambda function. 238 00:12:29,033 --> 00:12:31,534 [Author typing] 239 00:12:31,559 --> 00:12:33,929 So which is an inline method to be the 240 00:12:33,929 --> 00:12:37,739 function. So you'd say lambda x, x is a 241 00:12:37,739 --> 00:12:41,221 temporary variable there. You say x.latitude, 242 00:12:41,245 --> 00:12:43,699 [No audio] 243 00:12:43,724 --> 00:12:45,179 and let's give it like that 244 00:12:45,179 --> 00:12:50,129 for now. df there, and let's see 245 00:12:50,129 --> 00:12:53,955 what we got. Well, it says that a NoneType 246 00:12:53,980 --> 00:12:55,860 object has no attribute latitude, 247 00:12:56,434 --> 00:12:58,474 as so this can be quite tricky if you're 248 00:12:58,499 --> 00:13:01,919 not experienced with geocoding. And yeah, 249 00:13:01,919 --> 00:13:04,619 the reason we got this is that we 250 00:13:04,619 --> 00:13:08,219 have a None row value in there among 251 00:13:08,219 --> 00:13:11,789 our rows, and the None row which is not a 252 00:13:11,789 --> 00:13:14,729 location datatype does not have a 253 00:13:14,729 --> 00:13:18,899 latitude method. Because, you know, 254 00:13:18,899 --> 00:13:21,269 what we did here is we are storing all 255 00:13:21,269 --> 00:13:24,359 of this values. So it is like a loop, we 256 00:13:24,359 --> 00:13:26,039 are storing all these values in the 257 00:13:26,069 --> 00:13:28,949 temporary x variable, then for each of 258 00:13:28,949 --> 00:13:31,769 these value, we apply the latitude. So 259 00:13:31,769 --> 00:13:33,899 what Python will do is, it will go 260 00:13:33,899 --> 00:13:37,199 through the first row, and it will apply 261 00:13:37,199 --> 00:13:39,959 the latitude methods to the first row. 262 00:13:40,814 --> 00:13:42,754 And it will store it in the Latitudes 263 00:13:42,779 --> 00:13:44,849 column, and then it goes to the second 264 00:13:44,849 --> 00:13:46,739 value, but in this value, latitude is 265 00:13:46,739 --> 00:13:48,929 not existent for None. So you get an 266 00:13:48,929 --> 00:13:51,719 error. To do that, you could apply 267 00:13:52,229 --> 00:13:54,509 conditional, an inline if conditional. 268 00:13:55,182 --> 00:13:58,043 You say if x is not 269 00:13:58,068 --> 00:14:00,073 [No audio] 270 00:14:00,097 --> 00:14:03,192 None else None. 271 00:14:04,139 --> 00:14:06,719 Yeah, I know it's a bit confusing, but what we 272 00:14:06,719 --> 00:14:11,069 did is, you know, apply.latitude if x 273 00:14:11,099 --> 00:14:14,069 is not None. So it will apply this 274 00:14:14,069 --> 00:14:17,279 method for those rows, for those values. 275 00:14:18,574 --> 00:14:21,724 Otherwise, it will store None in the 276 00:14:21,749 --> 00:14:24,269 current cell or the Latitude column. 277 00:14:25,559 --> 00:14:27,599 So I hope that is clear. Now I'll 278 00:14:27,599 --> 00:14:31,559 execute here, and yeah, we got the 279 00:14:31,559 --> 00:14:32,879 Latitude column in there. 280 00:14:34,079 --> 00:14:38,270 Great, and we can do the same for Longitude 281 00:14:38,314 --> 00:14:41,131 [No audio] 282 00:14:41,156 --> 00:14:43,590 or Longitudes, and here as well. 283 00:14:43,614 --> 00:14:47,660 [Author typing] 284 00:14:47,685 --> 00:14:50,970 Yeah, that was quick and yeah, that's it. You 285 00:14:50,970 --> 00:14:53,070 have a Latitude and Longitude column in 286 00:14:53,070 --> 00:14:55,320 your DataFrame. So please, have a 287 00:14:55,320 --> 00:14:58,920 second look at what I wrote in here. So 288 00:14:58,920 --> 00:15:00,690 yeah, we have quite a lot of flexibility 289 00:15:00,690 --> 00:15:03,900 working with DataFrames. I hope you 290 00:15:03,900 --> 00:15:06,540 enjoyed this and I'll talk to you in the 291 00:15:06,565 --> 00:15:08,546 next lectures. See you.