1
00:00:00,356 --> 00:00:02,950
- [Instructor] So before we
get into an actual example

2
00:00:02,950 --> 00:00:05,650
of Hadoop let's talk a little
bit about its background.

3
00:00:05,650 --> 00:00:08,570
Now Google launched back in 1998

4
00:00:08,570 --> 00:00:10,580
as a web search company.

5
00:00:10,580 --> 00:00:14,970
And at that time there were
about 2.4 million websites.

6
00:00:14,970 --> 00:00:19,380
And already that was a
tremendous amount of data online.

7
00:00:19,380 --> 00:00:20,660
But just to give you a sense

8
00:00:20,660 --> 00:00:23,900
of how much the web has
exploded since then,

9
00:00:23,900 --> 00:00:26,750
there are now about 2 billion websites,

10
00:00:26,750 --> 00:00:28,960
which is approximately at 1000 fold

11
00:00:28,960 --> 00:00:31,860
the increase over the number of websites

12
00:00:31,860 --> 00:00:35,490
that there were when Google
first started handling search.

13
00:00:35,490 --> 00:00:37,730
So nowadays Google is actually handling

14
00:00:37,730 --> 00:00:40,323
over 2 trillion searches per year.

15
00:00:41,290 --> 00:00:44,670
So early on when Google was
working on their search engine,

16
00:00:44,670 --> 00:00:46,720
they knew that they needed the ability

17
00:00:46,720 --> 00:00:48,450
to return results quickly

18
00:00:48,450 --> 00:00:50,560
and they determined that
the only practical way

19
00:00:50,560 --> 00:00:52,190
for them to do that was literally

20
00:00:52,190 --> 00:00:54,580
to store the entire internet

21
00:00:54,580 --> 00:00:57,500
and index it for fast search.

22
00:00:57,500 --> 00:01:01,340
So unfortunately the
computers of that time,

23
00:01:01,340 --> 00:01:03,820
were unable to store and analyze

24
00:01:03,820 --> 00:01:06,750
that volume of data in
an economical manner

25
00:01:06,750 --> 00:01:08,370
and in a fast enough manner.

26
00:01:08,370 --> 00:01:11,350
So what Google did is they developed

27
00:01:11,350 --> 00:01:14,820
their own hardware and
software for that purpose.

28
00:01:14,820 --> 00:01:17,040
They developed a clustering system

29
00:01:17,040 --> 00:01:21,070
with massive numbers of computers
connected by the internet,

30
00:01:21,070 --> 00:01:23,280
those nodes on the network

31
00:01:23,280 --> 00:01:26,850
were where all the searching
was actually performed.

32
00:01:26,850 --> 00:01:29,190
And because there were
so many different nodes

33
00:01:29,190 --> 00:01:32,260
on the network there
was a much higher chance

34
00:01:32,260 --> 00:01:34,450
that those network elements

35
00:01:34,450 --> 00:01:37,020
could fail the hardware around the world.

36
00:01:37,020 --> 00:01:40,350
And so they built in a
lot of redundancy as well.

37
00:01:40,350 --> 00:01:42,950
So basically what they did
is they took the internet

38
00:01:42,950 --> 00:01:47,130
and distributed in chunks
across all these nodes

39
00:01:47,130 --> 00:01:49,060
on the network that they created.

40
00:01:49,060 --> 00:01:51,910
And they implemented these
with commodity computers

41
00:01:51,910 --> 00:01:53,050
so if one failed,

42
00:01:53,050 --> 00:01:56,310
they could just swap it
out quickly with a new one.

43
00:01:56,310 --> 00:01:59,570
Now when they finished
implementing the system

44
00:01:59,570 --> 00:02:01,560
and a search request would come in,

45
00:02:01,560 --> 00:02:03,830
what would happen is
they would put to work

46
00:02:03,830 --> 00:02:05,930
all the computers in the cluster

47
00:02:05,930 --> 00:02:09,850
to search their portion
of the web in parallel,

48
00:02:09,850 --> 00:02:11,850
then the results would be gathered

49
00:02:11,850 --> 00:02:13,773
and returned to the user.

50
00:02:14,660 --> 00:02:16,070
To implement that system,

51
00:02:16,070 --> 00:02:19,990
Google created their own
clustering hardware and software.

52
00:02:19,990 --> 00:02:21,040
And they didn't make

53
00:02:21,040 --> 00:02:23,670
that hardware and software
available to everybody else

54
00:02:23,670 --> 00:02:26,260
but they did publish their designs.

55
00:02:26,260 --> 00:02:28,060
And one of the key publications

56
00:02:28,060 --> 00:02:31,560
that they put out there was
the Google file system paper

57
00:02:31,560 --> 00:02:34,270
which you can find at this URL here.

58
00:02:34,270 --> 00:02:37,630
And some programmers
over at Yahoo took that

59
00:02:37,630 --> 00:02:40,640
and wound up implementing
their own system.

60
00:02:40,640 --> 00:02:43,750
And they did open source their work

61
00:02:43,750 --> 00:02:46,340
which the Apache organization picked up

62
00:02:46,340 --> 00:02:49,560
and implemented as what
is now called Hadoop.

63
00:02:49,560 --> 00:02:51,840
And by the way that was
named for the elephant

64
00:02:51,840 --> 00:02:54,440
of a stuffed animal that belong to a child

65
00:02:54,440 --> 00:02:56,520
of one of the Hadoop creators.

66
00:02:56,520 --> 00:02:59,750
It was also the inspiration
for our textbook cover

67
00:02:59,750 --> 00:03:04,060
for the book intro to
Python for computer science

68
00:03:04,060 --> 00:03:06,420
and data science as well.

69
00:03:06,420 --> 00:03:09,240
Now there were also a couple
of additional Google papers

70
00:03:09,240 --> 00:03:12,610
that contributed to Hadoop's evolution.

71
00:03:12,610 --> 00:03:14,490
One of them was on MapReduce

72
00:03:14,490 --> 00:03:16,490
which you can find at this URL here.

73
00:03:16,490 --> 00:03:18,600
And that's the technique of programming

74
00:03:18,600 --> 00:03:20,070
that we're going to be demonstrating

75
00:03:20,070 --> 00:03:22,860
to you as part of our next example.

76
00:03:22,860 --> 00:03:25,800
And then they also had the big table paper

77
00:03:25,800 --> 00:03:26,633
which is the basis for
another Apache project

78
00:03:26,633 --> 00:03:31,573
called the HBase which is
one of the NoSQL key value

79
00:03:32,500 --> 00:03:34,470
and column-based databases

80
00:03:34,470 --> 00:03:36,733
that we discussed a little bit earlier.

81
00:03:38,100 --> 00:03:39,851
Now there are a couple of key components

82
00:03:39,851 --> 00:03:42,220
to working with Hadoop.

83
00:03:42,220 --> 00:03:44,903
One of them is the Hadoop file system,

84
00:03:44,903 --> 00:03:48,220
Hadoop distributed file
system to be more specific

85
00:03:48,220 --> 00:03:50,830
or HDFS for short

86
00:03:50,830 --> 00:03:52,750
and what that file system does

87
00:03:52,750 --> 00:03:54,200
is it enables you to store

88
00:03:54,200 --> 00:03:57,800
a massive amount of data across a cluster.

89
00:03:57,800 --> 00:03:59,400
Then we had MapReduce

90
00:03:59,400 --> 00:04:02,130
which is a system of
programming for processing

91
00:04:02,130 --> 00:04:05,570
the data across the cluster.

92
00:04:05,570 --> 00:04:07,200
And basically what it does

93
00:04:07,200 --> 00:04:10,160
is in a massively parallel manner,

94
00:04:10,160 --> 00:04:11,981
it enables you to implement

95
00:04:11,981 --> 00:04:15,780
functional style filter
MapReduce operations.

96
00:04:15,780 --> 00:04:16,990
It's called MapReduce

97
00:04:16,990 --> 00:04:19,940
but the mapping step is really up to you

98
00:04:19,940 --> 00:04:21,140
as to what you wanna do in it

99
00:04:21,140 --> 00:04:24,190
and it can include filtering as well.

100
00:04:24,190 --> 00:04:25,920
So the the mapping step

101
00:04:25,920 --> 00:04:28,250
is where you're going to
process the original data

102
00:04:28,250 --> 00:04:32,120
and you process it into
tuples of key value pairs.

103
00:04:32,120 --> 00:04:35,480
And then the reduction
step combines those tuples

104
00:04:35,480 --> 00:04:37,280
to produce the final results

105
00:04:37,280 --> 00:04:40,140
of whatever the MapReduce task is.

106
00:04:40,140 --> 00:04:41,770
Now when you run your task,

107
00:04:41,770 --> 00:04:44,860
Hadoop divides the data in two batches.

108
00:04:44,860 --> 00:04:46,616
The data gets distributed

109
00:04:46,616 --> 00:04:51,000
over whatever number of nodes
there are in your cluster

110
00:04:51,000 --> 00:04:53,690
which could be anywhere from what we'll do

111
00:04:53,690 --> 00:04:55,490
which will be two nodes

112
00:04:55,490 --> 00:04:57,090
up to thousands of nodes

113
00:04:57,090 --> 00:04:59,750
in some of the biggest
clusters in the world

114
00:04:59,750 --> 00:05:01,540
and along with the data,

115
00:05:01,540 --> 00:05:05,550
Hadoop also distributes
the MapReduce tasks code

116
00:05:05,550 --> 00:05:08,290
so that it can be executed in parallel

117
00:05:08,290 --> 00:05:10,610
across all the nodes in the cluster

118
00:05:10,610 --> 00:05:12,690
on the particular batches of data

119
00:05:12,690 --> 00:05:14,670
that are stored on each node.

120
00:05:14,670 --> 00:05:16,060
So on a given node,

121
00:05:16,060 --> 00:05:17,360
it will execute a copy

122
00:05:17,360 --> 00:05:19,760
of the MapReduce algorithm
that you implement

123
00:05:19,760 --> 00:05:21,350
on the piece of the data

124
00:05:21,350 --> 00:05:23,540
that's stored within that node.

125
00:05:23,540 --> 00:05:25,080
Then during the reduction step,

126
00:05:25,080 --> 00:05:27,270
all of the results will be combined

127
00:05:27,270 --> 00:05:30,590
into the final results
of your application.

128
00:05:30,590 --> 00:05:35,380
And all of that is coordinated
by a tool called YARN

129
00:05:35,380 --> 00:05:38,890
which stands for yet
another resource negotiator.

130
00:05:38,890 --> 00:05:40,770
It is responsible for managing

131
00:05:40,770 --> 00:05:43,070
all of the resources in the cluster

132
00:05:43,070 --> 00:05:46,103
in scheduling the tasks
that will be performed.

133
00:05:47,100 --> 00:05:51,190
Now although Hadoop began
with HDFS and MapReduce

134
00:05:51,190 --> 00:05:53,460
and was followed very closely by YARN,

135
00:05:53,460 --> 00:05:55,530
it's actually a massive ecosystem

136
00:05:55,530 --> 00:05:58,730
with many other components to it nowadays.

137
00:05:58,730 --> 00:06:01,290
So here's a couple of links to articles

138
00:06:01,290 --> 00:06:04,220
that talk about various
ecosystem components.

139
00:06:04,220 --> 00:06:06,790
And I've listed some of them for you here

140
00:06:06,790 --> 00:06:08,820
and on the next two slides as well.

141
00:06:08,820 --> 00:06:11,050
So these are all Apache projects

142
00:06:11,050 --> 00:06:12,310
that you can take a look at.

143
00:06:12,310 --> 00:06:15,700
And I've provided you
URLs for each of them

144
00:06:15,700 --> 00:06:18,640
if you want to get an overview
of what they're about.

145
00:06:18,640 --> 00:06:21,315
So Ambari and drill are two of them.

146
00:06:21,315 --> 00:06:26,160
Flume, Hbase, Hive and
Impala the next four

147
00:06:26,160 --> 00:06:27,550
and then finally as well

148
00:06:27,550 --> 00:06:31,480
Kafka, Pig, Scoop, storm and Zookeeper.

149
00:06:31,480 --> 00:06:35,740
And this is just some of the
Hadoop's ecosystem components

150
00:06:35,740 --> 00:06:37,120
that are now out there.

151
00:06:37,120 --> 00:06:39,530
And we'll focus in this lesson

152
00:06:39,530 --> 00:06:42,060
on just the basic use of Hadoop

153
00:06:42,060 --> 00:06:45,060
with MapReduce and YARN being used

154
00:06:45,060 --> 00:06:47,553
to execute our application.