1 00:00:00,740 --> 00:00:02,640 - So, next let's go ahead and take a look 2 00:00:02,640 --> 00:00:04,280 at the Reducer script 3 00:00:04,280 --> 00:00:06,690 and just like we did in the Mapper script, 4 00:00:06,690 --> 00:00:09,421 we start out by telling it to use Python3 5 00:00:09,421 --> 00:00:11,570 to execute this script. 6 00:00:11,570 --> 00:00:15,490 Now, for this example, we again define a generator function 7 00:00:15,490 --> 00:00:17,050 called Tokenize Input. 8 00:00:17,050 --> 00:00:21,330 In this case, every line we receive is going to 9 00:00:21,330 --> 00:00:23,810 represent one of those key value pairs 10 00:00:23,810 --> 00:00:26,920 produced by the mapper, but we're going to be reading them 11 00:00:26,920 --> 00:00:28,415 from the standard input stream 12 00:00:28,415 --> 00:00:31,260 which the Hadoop system is going to 13 00:00:31,260 --> 00:00:35,810 automatically redirect for us from our mapper script 14 00:00:35,810 --> 00:00:38,100 so we don't have to worry about that. 15 00:00:38,100 --> 00:00:40,610 Now, what we're going to do in this example 16 00:00:40,610 --> 00:00:42,291 is strip off any white space 17 00:00:42,291 --> 00:00:44,880 at the beginning or end of the line. 18 00:00:44,880 --> 00:00:47,700 Remember, the print function which we used over here 19 00:00:47,700 --> 00:00:51,101 in the mapper to print out each key value pair 20 00:00:51,101 --> 00:00:53,880 inserts a new line character at the end 21 00:00:53,880 --> 00:00:57,029 so we're basically stripping off that new line character. 22 00:00:57,029 --> 00:01:00,410 Then, we're splitting the string at the tab character 23 00:01:00,410 --> 00:01:03,914 which gives us back a list of the two strings 24 00:01:03,914 --> 00:01:07,440 in that tuple of information, 25 00:01:07,440 --> 00:01:10,140 the key and the corresponding value. 26 00:01:10,140 --> 00:01:12,750 So, as you can see here, we have a for loop 27 00:01:12,750 --> 00:01:15,700 in which we're using a built-in function 28 00:01:15,700 --> 00:01:19,930 of the Itertools Module from the Python standard library. 29 00:01:19,930 --> 00:01:24,930 And, it's going to group all of the keys together for us. 30 00:01:26,420 --> 00:01:29,790 And, so what we're going to do here is feed that 31 00:01:29,790 --> 00:01:31,810 the tokenize input call which is going to 32 00:01:31,810 --> 00:01:34,369 grab all the key value pairs, 33 00:01:34,369 --> 00:01:35,610 and we're going to 34 00:01:35,610 --> 00:01:38,430 group all of those key value pairs by key. 35 00:01:38,430 --> 00:01:40,940 And, of course, the keys are the word lengths 36 00:01:40,940 --> 00:01:45,580 and the corresponding group will contain the key value pairs 37 00:01:45,580 --> 00:01:48,440 that represent each word of a given length. 38 00:01:48,440 --> 00:01:52,009 So, if the word length is 10, the group associated 39 00:01:52,009 --> 00:01:56,210 with the word length 10, will have all the tuples that 40 00:01:56,210 --> 00:01:59,398 have the key 10 and the value 1, and then the 41 00:01:59,398 --> 00:02:03,880 code in the body, here, is going to be responsible 42 00:02:03,880 --> 00:02:08,880 for summing up those values into a single key value pair. 43 00:02:09,410 --> 00:02:12,340 So, you'll notice as the second argument of groupby, 44 00:02:12,340 --> 00:02:15,560 we have to tell it for each of the lists 45 00:02:15,560 --> 00:02:17,980 that we're getting back from Tokenize Input, 46 00:02:17,980 --> 00:02:22,030 which item number within that list we would like to groupby 47 00:02:22,030 --> 00:02:25,730 and item number 0 in the list is going to be the key 48 00:02:25,730 --> 00:02:28,760 of the key value pair that was split up above. 49 00:02:28,760 --> 00:02:31,930 Then, we're going to use the sum function 50 00:02:31,930 --> 00:02:35,490 that's built into Python and we're going to 51 00:02:35,490 --> 00:02:39,775 basically iterate through all the word-length count pairs 52 00:02:39,775 --> 00:02:43,110 in the current group that we're processing. 53 00:02:43,110 --> 00:02:48,110 And, we're going to sum up those values for the counts. 54 00:02:48,270 --> 00:02:50,680 So, we're going to total up the counts, and then 55 00:02:50,680 --> 00:02:54,240 the final resulting output of our reducer 56 00:02:54,240 --> 00:02:57,570 for that given word length will a new tuple 57 00:02:57,570 --> 00:03:01,846 of the word length a tab and the string representation 58 00:03:01,846 --> 00:03:05,680 of that total number of words of that length. 59 00:03:05,680 --> 00:03:08,730 And, so we'll do that for all the different groups 60 00:03:08,730 --> 00:03:11,646 and as soon as the reducer completes its task, 61 00:03:11,646 --> 00:03:15,370 then our Hadoop Map Reduce application 62 00:03:15,370 --> 00:03:16,383 will be complete, and we'll 63 00:03:16,383 --> 00:03:19,083 be able to look at the output.