1
00:00:00,750 --> 00:00:02,360
- [Instructor] Over the
next couple of videos,

2
00:00:02,360 --> 00:00:05,810
I'm going to be demonstrating
how to create a cloud-based,

3
00:00:05,810 --> 00:00:08,120
multi-node cluster of computers

4
00:00:08,120 --> 00:00:12,300
via Microsoft's Azure HDInsight service.

5
00:00:12,300 --> 00:00:16,530
And one of its many capabilities
is to provide you Hadoop

6
00:00:16,530 --> 00:00:18,970
as a service running in the cloud.

7
00:00:18,970 --> 00:00:22,310
And there are ways to run
Hadoop locally as well.

8
00:00:22,310 --> 00:00:26,830
For example, companies like
Hortonworks and Cloudera,

9
00:00:26,830 --> 00:00:30,070
which are merging, provide
downloadable setups

10
00:00:30,070 --> 00:00:34,190
that you can use, but they have
massive system requirements.

11
00:00:34,190 --> 00:00:36,860
So it's actually somewhat
easier to play around

12
00:00:36,860 --> 00:00:39,860
with this concept in the cloud if you can.

13
00:00:39,860 --> 00:00:42,080
And for the purpose of this example,

14
00:00:42,080 --> 00:00:47,080
we used the free credits that
were provided by Microsoft

15
00:00:47,140 --> 00:00:50,730
with a brand new account that we set up.

16
00:00:50,730 --> 00:00:53,550
If you haven't set up such
an account previously,

17
00:00:53,550 --> 00:00:55,530
you could do that as well.

18
00:00:55,530 --> 00:00:59,180
Otherwise, you would have
to at least for the purpose

19
00:00:59,180 --> 00:01:03,210
of running the example, pay
for using those services.

20
00:01:03,210 --> 00:01:04,400
But as you'll see,

21
00:01:04,400 --> 00:01:07,300
we're going to configure
a minimal cluster.

22
00:01:07,300 --> 00:01:11,880
The application, itself, only
takes a few seconds to run.

23
00:01:11,880 --> 00:01:15,200
So as soon as you finish
executing the application,

24
00:01:15,200 --> 00:01:17,670
you can actually shut down your cluster,

25
00:01:17,670 --> 00:01:21,460
delete all its resources, and
potentially only be charged

26
00:01:21,460 --> 00:01:25,480
a few cents if in fact you are not working

27
00:01:25,480 --> 00:01:27,830
with the new account credit.

28
00:01:27,830 --> 00:01:30,720
So once we set up the
cluster, we're going to use

29
00:01:30,720 --> 00:01:34,600
that cluster to demonstrate
Hadoop's MapReduce capability.

30
00:01:34,600 --> 00:01:39,090
And for our example, what
we're going to do is parse all

31
00:01:39,090 --> 00:01:41,540
of the words in "Romeo and Juliet,"

32
00:01:41,540 --> 00:01:44,840
and we're going to be determining
for each of those words

33
00:01:44,840 --> 00:01:47,230
what the length of the word is.

34
00:01:47,230 --> 00:01:50,640
Then our reduction step
is going to summarize

35
00:01:50,640 --> 00:01:54,330
how many words there
are of each word length.

36
00:01:54,330 --> 00:01:57,440
The kind of canonical
example for getting started

37
00:01:57,440 --> 00:02:00,890
with Hadoop is word frequency counting,

38
00:02:00,890 --> 00:02:03,890
and we just wanted to do
something a little bit different

39
00:02:03,890 --> 00:02:06,740
since we've already done
word frequency counting

40
00:02:06,740 --> 00:02:08,670
in earlier examples.

41
00:02:08,670 --> 00:02:12,110
Now once we have the code
for our MapReduce task,

42
00:02:12,110 --> 00:02:15,660
we're then going to use
Yarn to submit that task

43
00:02:15,660 --> 00:02:18,550
to the HDInsight cluster for execution.

44
00:02:18,550 --> 00:02:21,650
Then from that point forward,
Yarn and Hadoop are going

45
00:02:21,650 --> 00:02:25,560
to decide how to use the
cluster of computers we set up

46
00:02:25,560 --> 00:02:27,170
to perform that task.

47
00:02:27,170 --> 00:02:28,660
And at the end of that,

48
00:02:28,660 --> 00:02:31,243
we'll take a look at the final results.