1 00:00:00,356 --> 00:00:02,950 - [Instructor] So before we get into an actual example 2 00:00:02,950 --> 00:00:05,650 of Hadoop let's talk a little bit about its background. 3 00:00:05,650 --> 00:00:08,570 Now Google launched back in 1998 4 00:00:08,570 --> 00:00:10,580 as a web search company. 5 00:00:10,580 --> 00:00:14,970 And at that time there were about 2.4 million websites. 6 00:00:14,970 --> 00:00:19,380 And already that was a tremendous amount of data online. 7 00:00:19,380 --> 00:00:20,660 But just to give you a sense 8 00:00:20,660 --> 00:00:23,900 of how much the web has exploded since then, 9 00:00:23,900 --> 00:00:26,750 there are now about 2 billion websites, 10 00:00:26,750 --> 00:00:28,960 which is approximately at 1000 fold 11 00:00:28,960 --> 00:00:31,860 the increase over the number of websites 12 00:00:31,860 --> 00:00:35,490 that there were when Google first started handling search. 13 00:00:35,490 --> 00:00:37,730 So nowadays Google is actually handling 14 00:00:37,730 --> 00:00:40,323 over 2 trillion searches per year. 15 00:00:41,290 --> 00:00:44,670 So early on when Google was working on their search engine, 16 00:00:44,670 --> 00:00:46,720 they knew that they needed the ability 17 00:00:46,720 --> 00:00:48,450 to return results quickly 18 00:00:48,450 --> 00:00:50,560 and they determined that the only practical way 19 00:00:50,560 --> 00:00:52,190 for them to do that was literally 20 00:00:52,190 --> 00:00:54,580 to store the entire internet 21 00:00:54,580 --> 00:00:57,500 and index it for fast search. 22 00:00:57,500 --> 00:01:01,340 So unfortunately the computers of that time, 23 00:01:01,340 --> 00:01:03,820 were unable to store and analyze 24 00:01:03,820 --> 00:01:06,750 that volume of data in an economical manner 25 00:01:06,750 --> 00:01:08,370 and in a fast enough manner. 26 00:01:08,370 --> 00:01:11,350 So what Google did is they developed 27 00:01:11,350 --> 00:01:14,820 their own hardware and software for that purpose. 28 00:01:14,820 --> 00:01:17,040 They developed a clustering system 29 00:01:17,040 --> 00:01:21,070 with massive numbers of computers connected by the internet, 30 00:01:21,070 --> 00:01:23,280 those nodes on the network 31 00:01:23,280 --> 00:01:26,850 were where all the searching was actually performed. 32 00:01:26,850 --> 00:01:29,190 And because there were so many different nodes 33 00:01:29,190 --> 00:01:32,260 on the network there was a much higher chance 34 00:01:32,260 --> 00:01:34,450 that those network elements 35 00:01:34,450 --> 00:01:37,020 could fail the hardware around the world. 36 00:01:37,020 --> 00:01:40,350 And so they built in a lot of redundancy as well. 37 00:01:40,350 --> 00:01:42,950 So basically what they did is they took the internet 38 00:01:42,950 --> 00:01:47,130 and distributed in chunks across all these nodes 39 00:01:47,130 --> 00:01:49,060 on the network that they created. 40 00:01:49,060 --> 00:01:51,910 And they implemented these with commodity computers 41 00:01:51,910 --> 00:01:53,050 so if one failed, 42 00:01:53,050 --> 00:01:56,310 they could just swap it out quickly with a new one. 43 00:01:56,310 --> 00:01:59,570 Now when they finished implementing the system 44 00:01:59,570 --> 00:02:01,560 and a search request would come in, 45 00:02:01,560 --> 00:02:03,830 what would happen is they would put to work 46 00:02:03,830 --> 00:02:05,930 all the computers in the cluster 47 00:02:05,930 --> 00:02:09,850 to search their portion of the web in parallel, 48 00:02:09,850 --> 00:02:11,850 then the results would be gathered 49 00:02:11,850 --> 00:02:13,773 and returned to the user. 50 00:02:14,660 --> 00:02:16,070 To implement that system, 51 00:02:16,070 --> 00:02:19,990 Google created their own clustering hardware and software. 52 00:02:19,990 --> 00:02:21,040 And they didn't make 53 00:02:21,040 --> 00:02:23,670 that hardware and software available to everybody else 54 00:02:23,670 --> 00:02:26,260 but they did publish their designs. 55 00:02:26,260 --> 00:02:28,060 And one of the key publications 56 00:02:28,060 --> 00:02:31,560 that they put out there was the Google file system paper 57 00:02:31,560 --> 00:02:34,270 which you can find at this URL here. 58 00:02:34,270 --> 00:02:37,630 And some programmers over at Yahoo took that 59 00:02:37,630 --> 00:02:40,640 and wound up implementing their own system. 60 00:02:40,640 --> 00:02:43,750 And they did open source their work 61 00:02:43,750 --> 00:02:46,340 which the Apache organization picked up 62 00:02:46,340 --> 00:02:49,560 and implemented as what is now called Hadoop. 63 00:02:49,560 --> 00:02:51,840 And by the way that was named for the elephant 64 00:02:51,840 --> 00:02:54,440 of a stuffed animal that belong to a child 65 00:02:54,440 --> 00:02:56,520 of one of the Hadoop creators. 66 00:02:56,520 --> 00:02:59,750 It was also the inspiration for our textbook cover 67 00:02:59,750 --> 00:03:04,060 for the book intro to Python for computer science 68 00:03:04,060 --> 00:03:06,420 and data science as well. 69 00:03:06,420 --> 00:03:09,240 Now there were also a couple of additional Google papers 70 00:03:09,240 --> 00:03:12,610 that contributed to Hadoop's evolution. 71 00:03:12,610 --> 00:03:14,490 One of them was on MapReduce 72 00:03:14,490 --> 00:03:16,490 which you can find at this URL here. 73 00:03:16,490 --> 00:03:18,600 And that's the technique of programming 74 00:03:18,600 --> 00:03:20,070 that we're going to be demonstrating 75 00:03:20,070 --> 00:03:22,860 to you as part of our next example. 76 00:03:22,860 --> 00:03:25,800 And then they also had the big table paper 77 00:03:25,800 --> 00:03:26,633 which is the basis for another Apache project 78 00:03:26,633 --> 00:03:31,573 called the HBase which is one of the NoSQL key value 79 00:03:32,500 --> 00:03:34,470 and column-based databases 80 00:03:34,470 --> 00:03:36,733 that we discussed a little bit earlier. 81 00:03:38,100 --> 00:03:39,851 Now there are a couple of key components 82 00:03:39,851 --> 00:03:42,220 to working with Hadoop. 83 00:03:42,220 --> 00:03:44,903 One of them is the Hadoop file system, 84 00:03:44,903 --> 00:03:48,220 Hadoop distributed file system to be more specific 85 00:03:48,220 --> 00:03:50,830 or HDFS for short 86 00:03:50,830 --> 00:03:52,750 and what that file system does 87 00:03:52,750 --> 00:03:54,200 is it enables you to store 88 00:03:54,200 --> 00:03:57,800 a massive amount of data across a cluster. 89 00:03:57,800 --> 00:03:59,400 Then we had MapReduce 90 00:03:59,400 --> 00:04:02,130 which is a system of programming for processing 91 00:04:02,130 --> 00:04:05,570 the data across the cluster. 92 00:04:05,570 --> 00:04:07,200 And basically what it does 93 00:04:07,200 --> 00:04:10,160 is in a massively parallel manner, 94 00:04:10,160 --> 00:04:11,981 it enables you to implement 95 00:04:11,981 --> 00:04:15,780 functional style filter MapReduce operations. 96 00:04:15,780 --> 00:04:16,990 It's called MapReduce 97 00:04:16,990 --> 00:04:19,940 but the mapping step is really up to you 98 00:04:19,940 --> 00:04:21,140 as to what you wanna do in it 99 00:04:21,140 --> 00:04:24,190 and it can include filtering as well. 100 00:04:24,190 --> 00:04:25,920 So the the mapping step 101 00:04:25,920 --> 00:04:28,250 is where you're going to process the original data 102 00:04:28,250 --> 00:04:32,120 and you process it into tuples of key value pairs. 103 00:04:32,120 --> 00:04:35,480 And then the reduction step combines those tuples 104 00:04:35,480 --> 00:04:37,280 to produce the final results 105 00:04:37,280 --> 00:04:40,140 of whatever the MapReduce task is. 106 00:04:40,140 --> 00:04:41,770 Now when you run your task, 107 00:04:41,770 --> 00:04:44,860 Hadoop divides the data in two batches. 108 00:04:44,860 --> 00:04:46,616 The data gets distributed 109 00:04:46,616 --> 00:04:51,000 over whatever number of nodes there are in your cluster 110 00:04:51,000 --> 00:04:53,690 which could be anywhere from what we'll do 111 00:04:53,690 --> 00:04:55,490 which will be two nodes 112 00:04:55,490 --> 00:04:57,090 up to thousands of nodes 113 00:04:57,090 --> 00:04:59,750 in some of the biggest clusters in the world 114 00:04:59,750 --> 00:05:01,540 and along with the data, 115 00:05:01,540 --> 00:05:05,550 Hadoop also distributes the MapReduce tasks code 116 00:05:05,550 --> 00:05:08,290 so that it can be executed in parallel 117 00:05:08,290 --> 00:05:10,610 across all the nodes in the cluster 118 00:05:10,610 --> 00:05:12,690 on the particular batches of data 119 00:05:12,690 --> 00:05:14,670 that are stored on each node. 120 00:05:14,670 --> 00:05:16,060 So on a given node, 121 00:05:16,060 --> 00:05:17,360 it will execute a copy 122 00:05:17,360 --> 00:05:19,760 of the MapReduce algorithm that you implement 123 00:05:19,760 --> 00:05:21,350 on the piece of the data 124 00:05:21,350 --> 00:05:23,540 that's stored within that node. 125 00:05:23,540 --> 00:05:25,080 Then during the reduction step, 126 00:05:25,080 --> 00:05:27,270 all of the results will be combined 127 00:05:27,270 --> 00:05:30,590 into the final results of your application. 128 00:05:30,590 --> 00:05:35,380 And all of that is coordinated by a tool called YARN 129 00:05:35,380 --> 00:05:38,890 which stands for yet another resource negotiator. 130 00:05:38,890 --> 00:05:40,770 It is responsible for managing 131 00:05:40,770 --> 00:05:43,070 all of the resources in the cluster 132 00:05:43,070 --> 00:05:46,103 in scheduling the tasks that will be performed. 133 00:05:47,100 --> 00:05:51,190 Now although Hadoop began with HDFS and MapReduce 134 00:05:51,190 --> 00:05:53,460 and was followed very closely by YARN, 135 00:05:53,460 --> 00:05:55,530 it's actually a massive ecosystem 136 00:05:55,530 --> 00:05:58,730 with many other components to it nowadays. 137 00:05:58,730 --> 00:06:01,290 So here's a couple of links to articles 138 00:06:01,290 --> 00:06:04,220 that talk about various ecosystem components. 139 00:06:04,220 --> 00:06:06,790 And I've listed some of them for you here 140 00:06:06,790 --> 00:06:08,820 and on the next two slides as well. 141 00:06:08,820 --> 00:06:11,050 So these are all Apache projects 142 00:06:11,050 --> 00:06:12,310 that you can take a look at. 143 00:06:12,310 --> 00:06:15,700 And I've provided you URLs for each of them 144 00:06:15,700 --> 00:06:18,640 if you want to get an overview of what they're about. 145 00:06:18,640 --> 00:06:21,315 So Ambari and drill are two of them. 146 00:06:21,315 --> 00:06:26,160 Flume, Hbase, Hive and Impala the next four 147 00:06:26,160 --> 00:06:27,550 and then finally as well 148 00:06:27,550 --> 00:06:31,480 Kafka, Pig, Scoop, storm and Zookeeper. 149 00:06:31,480 --> 00:06:35,740 And this is just some of the Hadoop's ecosystem components 150 00:06:35,740 --> 00:06:37,120 that are now out there. 151 00:06:37,120 --> 00:06:39,530 And we'll focus in this lesson 152 00:06:39,530 --> 00:06:42,060 on just the basic use of Hadoop 153 00:06:42,060 --> 00:06:45,060 with MapReduce and YARN being used 154 00:06:45,060 --> 00:06:47,553 to execute our application.