1 00:00:00,750 --> 00:00:02,360 - [Instructor] Over the next couple of videos, 2 00:00:02,360 --> 00:00:05,810 I'm going to be demonstrating how to create a cloud-based, 3 00:00:05,810 --> 00:00:08,120 multi-node cluster of computers 4 00:00:08,120 --> 00:00:12,300 via Microsoft's Azure HDInsight service. 5 00:00:12,300 --> 00:00:16,530 And one of its many capabilities is to provide you Hadoop 6 00:00:16,530 --> 00:00:18,970 as a service running in the cloud. 7 00:00:18,970 --> 00:00:22,310 And there are ways to run Hadoop locally as well. 8 00:00:22,310 --> 00:00:26,830 For example, companies like Hortonworks and Cloudera, 9 00:00:26,830 --> 00:00:30,070 which are merging, provide downloadable setups 10 00:00:30,070 --> 00:00:34,190 that you can use, but they have massive system requirements. 11 00:00:34,190 --> 00:00:36,860 So it's actually somewhat easier to play around 12 00:00:36,860 --> 00:00:39,860 with this concept in the cloud if you can. 13 00:00:39,860 --> 00:00:42,080 And for the purpose of this example, 14 00:00:42,080 --> 00:00:47,080 we used the free credits that were provided by Microsoft 15 00:00:47,140 --> 00:00:50,730 with a brand new account that we set up. 16 00:00:50,730 --> 00:00:53,550 If you haven't set up such an account previously, 17 00:00:53,550 --> 00:00:55,530 you could do that as well. 18 00:00:55,530 --> 00:00:59,180 Otherwise, you would have to at least for the purpose 19 00:00:59,180 --> 00:01:03,210 of running the example, pay for using those services. 20 00:01:03,210 --> 00:01:04,400 But as you'll see, 21 00:01:04,400 --> 00:01:07,300 we're going to configure a minimal cluster. 22 00:01:07,300 --> 00:01:11,880 The application, itself, only takes a few seconds to run. 23 00:01:11,880 --> 00:01:15,200 So as soon as you finish executing the application, 24 00:01:15,200 --> 00:01:17,670 you can actually shut down your cluster, 25 00:01:17,670 --> 00:01:21,460 delete all its resources, and potentially only be charged 26 00:01:21,460 --> 00:01:25,480 a few cents if in fact you are not working 27 00:01:25,480 --> 00:01:27,830 with the new account credit. 28 00:01:27,830 --> 00:01:30,720 So once we set up the cluster, we're going to use 29 00:01:30,720 --> 00:01:34,600 that cluster to demonstrate Hadoop's MapReduce capability. 30 00:01:34,600 --> 00:01:39,090 And for our example, what we're going to do is parse all 31 00:01:39,090 --> 00:01:41,540 of the words in "Romeo and Juliet," 32 00:01:41,540 --> 00:01:44,840 and we're going to be determining for each of those words 33 00:01:44,840 --> 00:01:47,230 what the length of the word is. 34 00:01:47,230 --> 00:01:50,640 Then our reduction step is going to summarize 35 00:01:50,640 --> 00:01:54,330 how many words there are of each word length. 36 00:01:54,330 --> 00:01:57,440 The kind of canonical example for getting started 37 00:01:57,440 --> 00:02:00,890 with Hadoop is word frequency counting, 38 00:02:00,890 --> 00:02:03,890 and we just wanted to do something a little bit different 39 00:02:03,890 --> 00:02:06,740 since we've already done word frequency counting 40 00:02:06,740 --> 00:02:08,670 in earlier examples. 41 00:02:08,670 --> 00:02:12,110 Now once we have the code for our MapReduce task, 42 00:02:12,110 --> 00:02:15,660 we're then going to use Yarn to submit that task 43 00:02:15,660 --> 00:02:18,550 to the HDInsight cluster for execution. 44 00:02:18,550 --> 00:02:21,650 Then from that point forward, Yarn and Hadoop are going 45 00:02:21,650 --> 00:02:25,560 to decide how to use the cluster of computers we set up 46 00:02:25,560 --> 00:02:27,170 to perform that task. 47 00:02:27,170 --> 00:02:28,660 And at the end of that, 48 00:02:28,660 --> 00:02:31,243 we'll take a look at the final results.