1 00:00:00,760 --> 00:00:02,560 - [Instructor] In this video we're going to take a look 2 00:00:02,560 --> 00:00:05,680 at how you can use regular expressions with pandas 3 00:00:05,680 --> 00:00:07,940 to validate some data. 4 00:00:07,940 --> 00:00:10,550 So as you can see we've already imported pandas 5 00:00:10,550 --> 00:00:13,850 as PD like we've demonstrated previously. 6 00:00:13,850 --> 00:00:17,900 And lets go ahead and create a one dimensional series object 7 00:00:17,900 --> 00:00:21,280 that has custom indices, Boston and Miami, 8 00:00:21,280 --> 00:00:24,910 and the values associated with those indices. 9 00:00:24,910 --> 00:00:29,170 Now in the first case we have a valid five digit zip code, 10 00:00:29,170 --> 00:00:32,670 in the second case we do not, so keep that in mind 11 00:00:32,670 --> 00:00:35,620 as we get ready to validate the data. 12 00:00:35,620 --> 00:00:38,420 So I've displayed the zips series just so you 13 00:00:38,420 --> 00:00:41,120 can see the indices in the left-hand column 14 00:00:41,120 --> 00:00:43,320 and the values in the right-hand column. 15 00:00:43,320 --> 00:00:46,815 Notice by the way the data type is object, 16 00:00:46,815 --> 00:00:49,570 I remember that underneath the hood of series 17 00:00:49,570 --> 00:00:54,260 and data frames are numpy arrays and in numpy 18 00:00:54,260 --> 00:00:58,490 if the value is not numeric, it uses type objects 19 00:00:58,490 --> 00:01:03,420 to represent the actual data that's stored in each element. 20 00:01:03,420 --> 00:01:08,420 Now lets go ahead and insert an expression here that's 21 00:01:08,680 --> 00:01:13,680 going to validate the data in the values of this series. 22 00:01:14,150 --> 00:01:17,610 So you may recall that you can access various 23 00:01:17,610 --> 00:01:21,700 string processing capabilities for a series through 24 00:01:21,700 --> 00:01:26,700 the series STR attribute and in pandas the STR attribute 25 00:01:27,320 --> 00:01:30,130 also gives you access to regular 26 00:01:30,130 --> 00:01:32,860 expression processing capabilities. 27 00:01:32,860 --> 00:01:35,940 So this is the same match function that I talked 28 00:01:35,940 --> 00:01:38,260 about briefly in the context of 29 00:01:38,260 --> 00:01:41,370 our Python regular expressions presentations. 30 00:01:41,370 --> 00:01:46,370 Match is going to check whether every value in 31 00:01:46,610 --> 00:01:51,610 this series matches precisely the regular expression 32 00:01:51,650 --> 00:01:53,960 that is provided as an argument. 33 00:01:53,960 --> 00:01:58,130 And match is an entire string match, not a sub string 34 00:01:58,130 --> 00:01:59,670 within the string. 35 00:01:59,670 --> 00:02:01,590 So what's going to happen, this is basically 36 00:02:01,590 --> 00:02:05,140 a functional style programming expression, 37 00:02:05,140 --> 00:02:08,040 rather than me iterating through all of the values 38 00:02:08,040 --> 00:02:11,960 in this series, I simply say to the STR attribute, 39 00:02:11,960 --> 00:02:14,860 match this regular expression against every item in 40 00:02:14,860 --> 00:02:18,450 the series and it gives me back a new series 41 00:02:18,450 --> 00:02:22,090 of true false values telling me which cities 42 00:02:22,090 --> 00:02:26,470 had matching strings for that regular expression. 43 00:02:26,470 --> 00:02:30,280 So Boston would've been a valid five digit zip code, 44 00:02:30,280 --> 00:02:34,080 again this is saying five digits in a row. 45 00:02:34,080 --> 00:02:37,230 Whereas Miami only had four digits and therefore 46 00:02:37,230 --> 00:02:40,420 did not match that regular expression. 47 00:02:40,420 --> 00:02:44,170 Now sometimes you may not want to do a full match 48 00:02:44,170 --> 00:02:48,150 but simply to check whether a string contains 49 00:02:48,150 --> 00:02:50,570 a matching sub string. 50 00:02:50,570 --> 00:02:54,370 So lets create another series here called cities, 51 00:02:54,370 --> 00:02:56,940 in this case were just giving it a list as 52 00:02:56,940 --> 00:02:59,180 an argument containing two strings, 53 00:02:59,180 --> 00:03:04,180 Boston coma Massachusetts 02215 and Miami Florida 33101. 54 00:03:07,150 --> 00:03:11,190 When you don't specify indices for the values in a series, 55 00:03:11,190 --> 00:03:16,170 it automatically uses zero based indexing so if I go ahead 56 00:03:16,170 --> 00:03:20,100 and display this cities series you see element zero 57 00:03:20,100 --> 00:03:24,010 is the first string and element one is the second string 58 00:03:24,010 --> 00:03:25,150 in this case. 59 00:03:25,150 --> 00:03:29,340 Now lets say we wanted to simply check whether each of 60 00:03:29,340 --> 00:03:34,340 these strings contains a space followed by 61 00:03:34,400 --> 00:03:38,520 a capital letter A through Z, and in this case we're saying 62 00:03:38,520 --> 00:03:42,680 the quantifier two so we want back to back capital letters, 63 00:03:42,680 --> 00:03:46,820 like space M A and another space after that, 64 00:03:46,820 --> 00:03:50,140 so space M A space or space F L space, 65 00:03:50,140 --> 00:03:53,460 both of those are matches in this case. 66 00:03:53,460 --> 00:03:57,320 So again just like match contains is going to return 67 00:03:57,320 --> 00:04:02,287 a series of Booleans, indicating which indices had matches 68 00:04:02,287 --> 00:04:05,880 and in this case both of them have matches. 69 00:04:05,880 --> 00:04:10,250 Now if you were to use the match function in this case, 70 00:04:10,250 --> 00:04:14,820 remember match is looking for the entire value matching 71 00:04:14,820 --> 00:04:17,410 the regular expression and we are still using 72 00:04:17,410 --> 00:04:21,020 the same regular expression but in this case we get false 73 00:04:21,020 --> 00:04:25,280 for both of those because match matches the entire string 74 00:04:25,280 --> 00:04:29,410 in each of the values of the series whereas contains 75 00:04:29,410 --> 00:04:32,710 searches throughout that string looking for a match, 76 00:04:32,710 --> 00:04:35,240 but it can be anywhere in the string. 77 00:04:35,240 --> 00:04:38,320 And there are many additional regular 78 00:04:38,320 --> 00:04:42,030 expression capabilities that you can perform via 79 00:04:42,030 --> 00:04:44,633 that STR attribute as well.