1
00:00:00,740 --> 00:00:02,640
- So, next let's go ahead and take a look

2
00:00:02,640 --> 00:00:04,280
at the Reducer script

3
00:00:04,280 --> 00:00:06,690
and just like we did in the Mapper script,

4
00:00:06,690 --> 00:00:09,421
we start out by telling it to use Python3

5
00:00:09,421 --> 00:00:11,570
to execute this script.

6
00:00:11,570 --> 00:00:15,490
Now, for this example, we again
define a generator function

7
00:00:15,490 --> 00:00:17,050
called Tokenize Input.

8
00:00:17,050 --> 00:00:21,330
In this case, every line
we receive is going to

9
00:00:21,330 --> 00:00:23,810
represent one of those key value pairs

10
00:00:23,810 --> 00:00:26,920
produced by the mapper, but
we're going to be reading them

11
00:00:26,920 --> 00:00:28,415
from the standard input stream

12
00:00:28,415 --> 00:00:31,260
which the Hadoop system is going to

13
00:00:31,260 --> 00:00:35,810
automatically redirect for
us from our mapper script

14
00:00:35,810 --> 00:00:38,100
so we don't have to worry about that.

15
00:00:38,100 --> 00:00:40,610
Now, what we're going
to do in this example

16
00:00:40,610 --> 00:00:42,291
is strip off any white space

17
00:00:42,291 --> 00:00:44,880
at the beginning or end of the line.

18
00:00:44,880 --> 00:00:47,700
Remember, the print function
which we used over here

19
00:00:47,700 --> 00:00:51,101
in the mapper to print
out each key value pair

20
00:00:51,101 --> 00:00:53,880
inserts a new line character at the end

21
00:00:53,880 --> 00:00:57,029
so we're basically stripping
off that new line character.

22
00:00:57,029 --> 00:01:00,410
Then, we're splitting the
string at the tab character

23
00:01:00,410 --> 00:01:03,914
which gives us back a
list of the two strings

24
00:01:03,914 --> 00:01:07,440
in that tuple of information,

25
00:01:07,440 --> 00:01:10,140
the key and the corresponding value.

26
00:01:10,140 --> 00:01:12,750
So, as you can see
here, we have a for loop

27
00:01:12,750 --> 00:01:15,700
in which we're using a built-in function

28
00:01:15,700 --> 00:01:19,930
of the Itertools Module from
the Python standard library.

29
00:01:19,930 --> 00:01:24,930
And, it's going to group all
of the keys together for us.

30
00:01:26,420 --> 00:01:29,790
And, so what we're going
to do here is feed that

31
00:01:29,790 --> 00:01:31,810
the tokenize input call which is going to

32
00:01:31,810 --> 00:01:34,369
grab all the key value pairs,

33
00:01:34,369 --> 00:01:35,610
and we're going to

34
00:01:35,610 --> 00:01:38,430
group all of those key value pairs by key.

35
00:01:38,430 --> 00:01:40,940
And, of course, the keys
are the word lengths

36
00:01:40,940 --> 00:01:45,580
and the corresponding group
will contain the key value pairs

37
00:01:45,580 --> 00:01:48,440
that represent each
word of a given length.

38
00:01:48,440 --> 00:01:52,009
So, if the word length is
10, the group associated

39
00:01:52,009 --> 00:01:56,210
with the word length 10,
will have all the tuples that

40
00:01:56,210 --> 00:01:59,398
have the key 10 and the
value 1, and then the

41
00:01:59,398 --> 00:02:03,880
code in the body, here,
is going to be responsible

42
00:02:03,880 --> 00:02:08,880
for summing up those values
into a single key value pair.

43
00:02:09,410 --> 00:02:12,340
So, you'll notice as the
second argument of groupby,

44
00:02:12,340 --> 00:02:15,560
we have to tell it for each of the lists

45
00:02:15,560 --> 00:02:17,980
that we're getting back
from Tokenize Input,

46
00:02:17,980 --> 00:02:22,030
which item number within that
list we would like to groupby

47
00:02:22,030 --> 00:02:25,730
and item number 0 in the
list is going to be the key

48
00:02:25,730 --> 00:02:28,760
of the key value pair
that was split up above.

49
00:02:28,760 --> 00:02:31,930
Then, we're going to use the sum function

50
00:02:31,930 --> 00:02:35,490
that's built into Python
and we're going to

51
00:02:35,490 --> 00:02:39,775
basically iterate through all
the word-length count pairs

52
00:02:39,775 --> 00:02:43,110
in the current group
that we're processing.

53
00:02:43,110 --> 00:02:48,110
And, we're going to sum up
those values for the counts.

54
00:02:48,270 --> 00:02:50,680
So, we're going to total
up the counts, and then

55
00:02:50,680 --> 00:02:54,240
the final resulting output of our reducer

56
00:02:54,240 --> 00:02:57,570
for that given word
length will a new tuple

57
00:02:57,570 --> 00:03:01,846
of the word length a tab and
the string representation

58
00:03:01,846 --> 00:03:05,680
of that total number of
words of that length.

59
00:03:05,680 --> 00:03:08,730
And, so we'll do that for
all the different groups

60
00:03:08,730 --> 00:03:11,646
and as soon as the reducer
completes its task,

61
00:03:11,646 --> 00:03:15,370
then our Hadoop Map Reduce application

62
00:03:15,370 --> 00:03:16,383
will be complete, and we'll

63
00:03:16,383 --> 00:03:19,083
be able to look at the output.