1 00:00:00,480 --> 00:00:02,360 - (Instructor) To finish off our discussion 2 00:00:02,360 --> 00:00:04,240 of regular expressions, let's talk 3 00:00:04,240 --> 00:00:06,869 about how you can capture substrings 4 00:00:06,869 --> 00:00:11,160 within a matched set of characters. 5 00:00:11,160 --> 00:00:12,420 So for that purpose, 6 00:00:12,420 --> 00:00:14,320 I've already put in a few snippets here. 7 00:00:14,320 --> 00:00:19,290 We've imported the RE module so we can use its functions. 8 00:00:19,290 --> 00:00:21,640 We've defined a variable called text, 9 00:00:21,640 --> 00:00:25,080 which contains the text in which we will be searching, 10 00:00:25,080 --> 00:00:28,770 we've defined a regular expression pattern variable, 11 00:00:28,770 --> 00:00:30,930 and this is the regular expression 12 00:00:30,930 --> 00:00:32,310 we're going to search for. 13 00:00:32,310 --> 00:00:34,730 And you'll notice that we have within 14 00:00:34,730 --> 00:00:39,570 that expression a couple of parenthesized subexpressions, 15 00:00:39,570 --> 00:00:41,270 two of them to be exact, 16 00:00:41,270 --> 00:00:45,450 and what the parentheses meta characters mean is that 17 00:00:45,450 --> 00:00:49,010 if the entire pattern matches whatever string 18 00:00:49,010 --> 00:00:50,850 we're searching in, then we want 19 00:00:50,850 --> 00:00:53,376 to capture just the pieces of that pattern 20 00:00:53,376 --> 00:00:56,600 that are enclosed in parentheses. 21 00:00:56,600 --> 00:00:59,762 So this first piece here is looking for two words 22 00:00:59,762 --> 00:01:02,810 each of which start with a capital first letter, 23 00:01:02,810 --> 00:01:05,174 followed by one more lower case letters, 24 00:01:05,174 --> 00:01:08,111 separated by a space from one another. 25 00:01:08,111 --> 00:01:09,861 So you can see again, capital letter 26 00:01:09,861 --> 00:01:12,260 and one or more lower case letters. 27 00:01:12,260 --> 00:01:15,340 And that subexpression is going to be matching up 28 00:01:15,340 --> 00:01:19,014 with Charlie Cyan in the search string up above. 29 00:01:19,014 --> 00:01:21,902 Then we have some literal characters to match, 30 00:01:21,902 --> 00:01:24,790 and you can see these characters match precisely 31 00:01:24,790 --> 00:01:26,880 this piece of the search string. 32 00:01:26,880 --> 00:01:29,710 And then we have another subexpression 33 00:01:30,690 --> 00:01:33,490 in which we are looking for a primitive email address 34 00:01:33,490 --> 00:01:36,396 consisting of a one more more word characters 35 00:01:36,396 --> 00:01:39,959 an at sign, one or more word characters, 36 00:01:39,959 --> 00:01:43,860 a period, and three word characters. 37 00:01:43,860 --> 00:01:46,240 So that would be the com in .com. 38 00:01:46,240 --> 00:01:49,181 Now notice that we put a backslash before the dot here, 39 00:01:49,181 --> 00:01:50,967 that's because the dot is a 40 00:01:50,967 --> 00:01:54,260 regular expression meta character as well, 41 00:01:54,260 --> 00:01:56,288 so if we want to use it as a literal, 42 00:01:56,288 --> 00:01:58,570 which is what we're using here, 43 00:01:58,570 --> 00:02:01,230 we need to put a backslash in front of it. 44 00:02:01,230 --> 00:02:03,030 And because we have some backslashes 45 00:02:03,030 --> 00:02:04,620 in this regular expression, 46 00:02:04,620 --> 00:02:07,940 we are using a raw string once again. 47 00:02:07,940 --> 00:02:11,130 So now let's go ahead and execute a search. 48 00:02:11,130 --> 00:02:13,444 We're going to locate that pattern in text. 49 00:02:13,444 --> 00:02:15,846 Now search is going to find the first match 50 00:02:15,846 --> 00:02:19,200 of the entire regular expression, 51 00:02:19,200 --> 00:02:20,490 so we'll only get a match 52 00:02:20,490 --> 00:02:24,640 if the entire regular expression matches the string 53 00:02:24,640 --> 00:02:27,584 in which we're searching or a portion of that string. 54 00:02:27,584 --> 00:02:32,584 If there is a match, only then will the two subexpressions 55 00:02:33,600 --> 00:02:37,800 actually get captured, and once we have that match, 56 00:02:37,800 --> 00:02:41,140 we can then access those subexpressions. 57 00:02:41,140 --> 00:02:44,030 So if I say result, and I want to see 58 00:02:44,030 --> 00:02:46,460 what those subexpression contents were, 59 00:02:46,460 --> 00:02:49,985 I can use the groups method, and it will give me back 60 00:02:49,985 --> 00:02:53,860 all of the matches as a tuple, in this case. 61 00:02:53,860 --> 00:02:57,030 Charlie Cyan was the first subexpression match 62 00:02:57,030 --> 00:03:02,030 and demo1@deitel.com was the second subexpression match. 63 00:03:02,083 --> 00:03:05,344 Now if you want to see what the entire match was, 64 00:03:05,344 --> 00:03:09,430 you can say result.group, again, remember 65 00:03:09,430 --> 00:03:13,284 that the match has to be for the entire regular expression 66 00:03:13,284 --> 00:03:17,743 so you can see it matched the entire string up above. 67 00:03:17,743 --> 00:03:20,950 And then, finally, if you want to 68 00:03:20,950 --> 00:03:24,190 individually access the subexpressions 69 00:03:24,190 --> 00:03:26,270 that were captured, you can do that 70 00:03:26,270 --> 00:03:29,630 by passing arguments to the group method. 71 00:03:29,630 --> 00:03:32,680 Now, interestingly, those arguments are indexed 72 00:03:32,680 --> 00:03:35,320 from one, not from zero. 73 00:03:35,320 --> 00:03:38,721 So if I say result.group 74 00:03:38,721 --> 00:03:41,280 and I give it the argument one, 75 00:03:41,280 --> 00:03:44,112 that would give me the first match sub, 76 00:03:44,112 --> 00:03:46,773 first subexpression match, which was Charlie Cyan, 77 00:03:46,773 --> 00:03:48,731 and as you'd expect therefore, 78 00:03:48,731 --> 00:03:53,731 if I give it two the second piece was demo1@deitel.com. 79 00:03:54,440 --> 00:03:58,000 So if you're working with structured data 80 00:03:58,000 --> 00:04:00,530 where each of the strings 81 00:04:00,530 --> 00:04:03,540 that you're searching through has the same format 82 00:04:03,540 --> 00:04:06,971 and you want to extract information from that data, 83 00:04:06,971 --> 00:04:10,450 capturing subexpressions is a super easy 84 00:04:10,450 --> 00:04:12,263 and convenient way to do that.