1 00:00:00,350 --> 00:00:01,183 - [Narrator] In this video, 2 00:00:01,183 --> 00:00:03,740 I'm going to talk about a tool called Docker 3 00:00:03,740 --> 00:00:05,860 and the Jupyter Docker Stacks. 4 00:00:05,860 --> 00:00:09,890 Now, the software that we're dealing with in this chapter, 5 00:00:09,890 --> 00:00:12,630 things like MongoDB, 6 00:00:12,630 --> 00:00:14,350 clusters in the cloud. 7 00:00:14,350 --> 00:00:17,870 Hadoop clusters in the cloud, Spark clusters in the cloud. 8 00:00:17,870 --> 00:00:21,600 All of these require a ton of setup 9 00:00:21,600 --> 00:00:25,840 and they're complex configurations that are really difficult 10 00:00:25,840 --> 00:00:26,900 to get through, 11 00:00:26,900 --> 00:00:29,990 especially if you're somebody who's just getting started 12 00:00:29,990 --> 00:00:31,310 with these technologies. 13 00:00:31,310 --> 00:00:33,680 And that's why it's super handy 14 00:00:33,680 --> 00:00:37,110 to work with pre-configured environments such as, 15 00:00:37,110 --> 00:00:39,310 being able to set up a cluster 16 00:00:39,310 --> 00:00:42,500 conveniently in Microsoft's HD Insight 17 00:00:42,500 --> 00:00:45,860 or in similar services from Amazon or Google 18 00:00:45,860 --> 00:00:47,510 or IBM, et cetera. 19 00:00:47,510 --> 00:00:52,510 Or you sometimes will take advantage of pre-configured, 20 00:00:52,530 --> 00:00:56,110 what are known as containers that have all of the software 21 00:00:56,110 --> 00:00:59,630 set up done for you already, and you can run them, 22 00:00:59,630 --> 00:01:01,770 on your local computer, and that's what we're going 23 00:01:01,770 --> 00:01:04,850 to start with, as we get into Spark here, 24 00:01:04,850 --> 00:01:07,210 we're going to use a tool called Docker, 25 00:01:07,210 --> 00:01:09,210 which basically enables you 26 00:01:09,210 --> 00:01:11,570 to take a pre-configured container 27 00:01:11,570 --> 00:01:14,240 and run it on your local computer. 28 00:01:14,240 --> 00:01:16,800 And to do that, we're going to take advantage 29 00:01:16,800 --> 00:01:19,950 of a Jupyter team Docker Stack. 30 00:01:19,950 --> 00:01:22,550 The Jupyter team has configured a bunch of these for us 31 00:01:22,550 --> 00:01:24,480 for different development scenarios, 32 00:01:24,480 --> 00:01:26,930 one of which is already configured 33 00:01:26,930 --> 00:01:29,090 with the Spark application framework 34 00:01:29,090 --> 00:01:31,360 and the PySpark module that we need 35 00:01:31,360 --> 00:01:33,540 to access Spark application. 36 00:01:33,540 --> 00:01:36,810 The Spark application framework from Python. 37 00:01:36,810 --> 00:01:39,510 Now, in order to take advantage of this, 38 00:01:39,510 --> 00:01:43,680 you will have to install Docker on your local computer. 39 00:01:43,680 --> 00:01:45,300 And in order to do that, 40 00:01:45,300 --> 00:01:47,790 I've provided you with a number of links here. 41 00:01:47,790 --> 00:01:49,510 I don't go through the setup 42 00:01:49,510 --> 00:01:51,500 and installation of Docker itself. 43 00:01:51,500 --> 00:01:53,420 Generally, it's pretty straightforward, 44 00:01:53,420 --> 00:01:56,880 especially if you're on Windows 10 or macOS. 45 00:01:56,880 --> 00:01:58,710 You'll simply go to this link, 46 00:01:58,710 --> 00:02:01,710 download the appropriate installer, and run it. 47 00:02:01,710 --> 00:02:04,010 For those of you who are on Windows 10 pro, 48 00:02:04,010 --> 00:02:07,890 you must allow the installer to make changes to your system, 49 00:02:07,890 --> 00:02:10,690 otherwise Docker will not run correctly. 50 00:02:10,690 --> 00:02:14,210 So, do make sure that you follow the prompts carefully, 51 00:02:14,210 --> 00:02:16,150 as you're doing your install. 52 00:02:16,150 --> 00:02:18,030 If you're a Windows 10 Home user, 53 00:02:18,030 --> 00:02:21,360 unfortunately you can't install the same version of Docker. 54 00:02:21,360 --> 00:02:23,420 In your case, you will have to take advantage 55 00:02:23,420 --> 00:02:25,380 of a tool called Virtual Box. 56 00:02:25,380 --> 00:02:28,510 So, they tell you how to set that up at this link, 57 00:02:28,510 --> 00:02:30,040 and if you're a Linux user, 58 00:02:30,040 --> 00:02:32,500 you'll need to use the Docker community edition, 59 00:02:32,500 --> 00:02:36,830 and you can learn about installing that at this link here. 60 00:02:36,830 --> 00:02:39,870 And then everybody who's getting started with Docker 61 00:02:39,870 --> 00:02:42,900 for the very first time should take some time 62 00:02:42,900 --> 00:02:45,380 to read their getting started guide, 63 00:02:45,380 --> 00:02:48,708 which will give you an overview of the fundamentals 64 00:02:48,708 --> 00:02:50,800 of working with Docker. 65 00:02:50,800 --> 00:02:55,277 And in our case, we're running on a computer that has 66 00:02:55,277 --> 00:02:57,703 one CPU with four cores, 67 00:02:58,760 --> 00:03:02,550 generally you're going to need at least two cores in order 68 00:03:02,550 --> 00:03:06,420 to run something like Docker because it is a form 69 00:03:06,420 --> 00:03:09,130 of virtualization software. 70 00:03:09,130 --> 00:03:11,760 Now, once you've installed Docker, 71 00:03:11,760 --> 00:03:15,370 then you will be able to proceed with what we're going 72 00:03:15,370 --> 00:03:19,660 to show you next, which is how to get the Docker container 73 00:03:19,660 --> 00:03:20,653 up and running. 74 00:03:22,150 --> 00:03:24,770 Now, as I mentioned, the Jupyter Notebooks team 75 00:03:24,770 --> 00:03:27,190 has already pre-configured a bunch 76 00:03:27,190 --> 00:03:29,670 of different Jupyter Docker Stacks 77 00:03:29,670 --> 00:03:33,150 for various common Python development scenarios. 78 00:03:33,150 --> 00:03:38,010 One of them is called the jupyter/pyspark-notebook, 79 00:03:38,010 --> 00:03:40,950 and it has the Spark application framework. 80 00:03:40,950 --> 00:03:44,690 It has PySpark, it's got Jupyter lab, the interface in, 81 00:03:44,690 --> 00:03:47,920 which we've been running Jupyter Notebooks a couple of times 82 00:03:47,920 --> 00:03:51,790 in these Python fundamentals videos specifically, 83 00:03:51,790 --> 00:03:54,670 we used them back in the deep learning lesson, 84 00:03:54,670 --> 00:03:56,180 and you can see a full list 85 00:03:56,180 --> 00:03:59,393 of their pre-configured Jupyter Docker Stacks at this URL, 86 00:04:00,230 --> 00:04:03,690 and this run a command this Docker run command 87 00:04:03,690 --> 00:04:05,620 that you see here is meant to be entered, 88 00:04:05,620 --> 00:04:09,640 as one long command without pressing enter. 89 00:04:09,640 --> 00:04:12,700 Until you get to the very end, and I do want to point out 90 00:04:12,700 --> 00:04:14,910 that you don't want to type it exactly as is 91 00:04:14,910 --> 00:04:18,180 because you want to replace full path to 92 00:04:18,180 --> 00:04:20,210 with the exact location 93 00:04:20,210 --> 00:04:24,540 of our ch 16 folder on your computer. 94 00:04:24,540 --> 00:04:28,950 The full path to that location is going to be required 95 00:04:28,950 --> 00:04:30,300 for this to work correctly. 96 00:04:30,300 --> 00:04:32,950 And what's going to happen is if you type in 97 00:04:32,950 --> 00:04:36,310 your full path to the ch 16 folder correctly, 98 00:04:36,310 --> 00:04:41,310 you will be able in the work folder of the environment, 99 00:04:41,840 --> 00:04:46,190 the Docker container, to see your own local files. 100 00:04:46,190 --> 00:04:49,950 In that ch 16 folder, you'll be able to open files, 101 00:04:49,950 --> 00:04:52,970 from there, you'll be able to create new files there. 102 00:04:52,970 --> 00:04:55,400 And that's important because if you ever delete 103 00:04:55,400 --> 00:04:58,690 the Docker container, you don't want to lose your work. 104 00:04:58,690 --> 00:05:01,310 So, this is a critical aspect 105 00:05:01,310 --> 00:05:03,520 of launching the Docker container. 106 00:05:03,520 --> 00:05:07,580 Now, the let's just talk briefly about the different pieces 107 00:05:07,580 --> 00:05:09,810 of this and what's going to happen the very first time 108 00:05:09,810 --> 00:05:10,643 you do this. 109 00:05:10,643 --> 00:05:13,870 So, the Docker run command is used to execute 110 00:05:13,870 --> 00:05:15,560 a Docker container. 111 00:05:15,560 --> 00:05:19,140 The a run command has a lot of different options 112 00:05:19,140 --> 00:05:22,030 in this case, we have a couple of - p options, 113 00:05:22,030 --> 00:05:25,740 which means that we're going to open a port 114 00:05:25,740 --> 00:05:26,890 in the container. 115 00:05:26,890 --> 00:05:30,650 So this first 8888:8888, 116 00:05:30,650 --> 00:05:34,280 is going to enable you to go to your web browser 117 00:05:34,280 --> 00:05:37,450 and go to localhost:8888, 118 00:05:37,450 --> 00:05:41,600 and be able to, for example, view Jupyter lab. 119 00:05:41,600 --> 00:05:46,140 The second one, a 4040, is where you can actually view 120 00:05:46,140 --> 00:05:49,090 a Spark monitoring webpage, 121 00:05:49,090 --> 00:05:52,680 where you can see what the Spark applications are doing, 122 00:05:52,680 --> 00:05:57,510 in your Docker container and the next couple of options 123 00:05:57,510 --> 00:06:02,250 are going to enable us to log in as the root user 124 00:06:02,250 --> 00:06:03,490 in that container. 125 00:06:03,490 --> 00:06:07,830 They're basically Linux containers, behind the scenes. 126 00:06:07,830 --> 00:06:10,960 Next, the - v option is what's going to help us mount 127 00:06:10,960 --> 00:06:14,980 our local ch16 folder on our system 128 00:06:14,980 --> 00:06:19,720 into the work folder inside the Jupyter Docker Stack. 129 00:06:19,720 --> 00:06:24,430 And this is the actual name of the Jupyter Docker Stack, 130 00:06:24,430 --> 00:06:29,240 followed by a colon, and the specific version number, 131 00:06:29,240 --> 00:06:33,080 that we would like to download to our computer, 132 00:06:33,080 --> 00:06:35,840 on the Docker site, they have a repository 133 00:06:35,840 --> 00:06:39,190 of containers the Jupyter Docker Stacks are in that 134 00:06:39,190 --> 00:06:40,590 repository as well, 135 00:06:40,590 --> 00:06:43,550 and the very first time you run this command, 136 00:06:43,550 --> 00:06:46,740 it's going to go find this specific version 137 00:06:46,740 --> 00:06:49,460 of the Jupyter PYSpark Notebook Stack 138 00:06:49,460 --> 00:06:52,130 and download it onto your computer. 139 00:06:52,130 --> 00:06:53,480 If I remember correctly, 140 00:06:53,480 --> 00:06:56,390 it's about five or six gigabytes of information. 141 00:06:56,390 --> 00:06:58,550 So, it does take up space number one, 142 00:06:58,550 --> 00:07:00,720 and it does take a while to download, 143 00:07:00,720 --> 00:07:03,760 especially if you have a slower network connection. 144 00:07:03,760 --> 00:07:07,730 So, the very first time you run this, it is going to perform 145 00:07:07,730 --> 00:07:11,300 that download to bring this onto your computer. 146 00:07:11,300 --> 00:07:15,040 Then we're going to execute the start sh script, 147 00:07:15,040 --> 00:07:18,490 and we're going to launch Jupyter lab, 148 00:07:18,490 --> 00:07:21,393 as part of that Docker container. 149 00:07:22,560 --> 00:07:26,900 So, let's switch over to a terminal Window here on my Mac 150 00:07:26,900 --> 00:07:30,990 and just show you that I executed the Docker run command. 151 00:07:30,990 --> 00:07:35,440 Notice that I did specify the full path to the ch 16 folder 152 00:07:35,440 --> 00:07:36,550 on my machine. 153 00:07:36,550 --> 00:07:37,810 If you're a Windows user, 154 00:07:37,810 --> 00:07:40,860 this is probably going to be a path that starts with 155 00:07:40,860 --> 00:07:45,640 C colon backslash and then has path, the path separators, 156 00:07:45,640 --> 00:07:49,570 as backslashes leading up to the ch 16. 157 00:07:49,570 --> 00:07:51,610 Now, when you run this command 158 00:07:51,610 --> 00:07:54,360 the very first time, you're going to see some information 159 00:07:54,360 --> 00:07:58,470 here about the Docker Stack downloading to your computer. 160 00:07:58,470 --> 00:08:01,950 I already did this a while back, so it just went ahead 161 00:08:01,950 --> 00:08:04,740 and launched the Docker Stack for me, 162 00:08:04,740 --> 00:08:07,800 and it started to display some log information here, 163 00:08:07,800 --> 00:08:09,840 and then this key piece, 164 00:08:09,840 --> 00:08:13,610 which is where you're going to go in your web browser, 165 00:08:13,610 --> 00:08:18,320 to see the Jupyter lab interface and be able to work 166 00:08:18,320 --> 00:08:20,550 with this a Docker Stack. 167 00:08:20,550 --> 00:08:22,350 Now you can copy this, 168 00:08:22,350 --> 00:08:25,540 but you are going to need to modify this piece. 169 00:08:25,540 --> 00:08:28,210 So, let me go ahead and copy this here for a moment 170 00:08:29,550 --> 00:08:32,830 and I'm going to switch over to a web browser 171 00:08:33,710 --> 00:08:36,980 and oops and paste that in. 172 00:08:36,980 --> 00:08:39,430 And before I hit enter, I'm going to go ahead 173 00:08:39,430 --> 00:08:43,300 and change this parenthesized piece to just say, localhost. 174 00:08:45,080 --> 00:08:49,150 And then so, this token here is just a security token, 175 00:08:49,150 --> 00:08:51,720 that you need only the very first time, 176 00:08:51,720 --> 00:08:56,260 you launch a Jupyter lab in that environment. 177 00:08:56,260 --> 00:08:58,400 So, you notice it shows you a work folder, 178 00:08:58,400 --> 00:09:00,490 but if you navigate into that folder. 179 00:09:00,490 --> 00:09:04,810 You're now looking at the contents of the ch 16 folder, 180 00:09:04,810 --> 00:09:08,470 and we're going to be doing the Spark word count example, 181 00:09:08,470 --> 00:09:10,650 So, I'll go ahead and navigate in there. 182 00:09:10,650 --> 00:09:14,110 Your folder will only have one notebook in it. 183 00:09:14,110 --> 00:09:17,200 I have two, the version that I'm going to be using 184 00:09:17,200 --> 00:09:19,990 during the presentation has some additional text bullets 185 00:09:19,990 --> 00:09:22,220 in it for discussion purposes, 186 00:09:22,220 --> 00:09:25,230 yours just has the code in it. 187 00:09:25,230 --> 00:09:27,490 So, I'm going to go ahead and open that up, 188 00:09:27,490 --> 00:09:31,830 and while that's happening, I'm going to talk about a couple 189 00:09:31,830 --> 00:09:36,700 of other key items with respect to the Docker container. 190 00:09:36,700 --> 00:09:40,090 So first of all, once the Docker container is up 191 00:09:40,090 --> 00:09:42,810 and running, we are going to need to install 192 00:09:42,810 --> 00:09:44,900 some software into it. 193 00:09:44,900 --> 00:09:47,400 And the reason we need to do that is we're going to take 194 00:09:47,400 --> 00:09:50,010 advantage of some libraries that are not 195 00:09:50,010 --> 00:09:53,630 by default, part of that Docker container. 196 00:09:53,630 --> 00:09:57,800 So, I'm going to actually switch to another terminal Window, 197 00:09:57,800 --> 00:10:00,740 where I already did this and I just wanna show you, 198 00:10:00,740 --> 00:10:05,600 what I did to install software into my Docker container. 199 00:10:05,600 --> 00:10:08,910 First of all, again, this is a separate terminal Window, 200 00:10:08,910 --> 00:10:12,620 you would use a separate terminal shell or command prompt 201 00:10:12,620 --> 00:10:14,060 on your own system, 202 00:10:14,060 --> 00:10:18,350 and you'll notice I executed the Docker PS command, 203 00:10:18,350 --> 00:10:20,750 which lists out a bunch of information 204 00:10:20,750 --> 00:10:24,050 about the container that's running 205 00:10:24,050 --> 00:10:26,300 or the containers that are running. 206 00:10:26,300 --> 00:10:29,170 If you have multiple ones running on your machine, 207 00:10:29,170 --> 00:10:33,130 the most important thing I need from this list is the name 208 00:10:33,130 --> 00:10:35,990 of the container, which it randomly assigns. 209 00:10:35,990 --> 00:10:39,580 It's a two word name, normally separated by an underscore, 210 00:10:39,580 --> 00:10:42,350 and the reason I need that is for this command, 211 00:10:42,350 --> 00:10:46,350 which is going to allow me to log into the container, 212 00:10:46,350 --> 00:10:49,040 so that I can install software into it. 213 00:10:49,040 --> 00:10:52,440 So, the Docker exec command with the -it 214 00:10:52,440 --> 00:10:54,320 means interactive mode. 215 00:10:54,320 --> 00:10:57,110 This is the name of the container I want to log into, 216 00:10:57,110 --> 00:11:00,210 and I would like it to execute its shell, 217 00:11:00,210 --> 00:11:03,730 so that I can then interact with that container. 218 00:11:03,730 --> 00:11:07,240 And again, this is a Linux-based interaction that you are 219 00:11:07,240 --> 00:11:09,640 going to be doing in this case. 220 00:11:09,640 --> 00:11:13,210 So, this launches the shell for the container, 221 00:11:13,210 --> 00:11:17,060 and then I was able to execute a conda install command 222 00:11:17,060 --> 00:11:20,530 to install the text blob module and the tweepy module 223 00:11:20,530 --> 00:11:23,210 and down below here, it went through the process 224 00:11:23,210 --> 00:11:26,310 of figuring out all the different packages that I needed, 225 00:11:26,310 --> 00:11:28,770 and if I keep scrolling down, 226 00:11:28,770 --> 00:11:31,240 you'll see it tells me what's going to be installed, 227 00:11:31,240 --> 00:11:32,850 what was going to be updated, 228 00:11:32,850 --> 00:11:37,020 and it did have to actually downgrade a couple of modules 229 00:11:37,020 --> 00:11:38,420 in this case as well. 230 00:11:38,420 --> 00:11:40,610 I said yes, because I needed to install 231 00:11:40,610 --> 00:11:42,640 those for our demo purposes. 232 00:11:42,640 --> 00:11:44,570 It then installed all the software, 233 00:11:44,570 --> 00:11:49,170 and now I'm at the command prompt for my Docker container. 234 00:11:49,170 --> 00:11:54,050 Now, with that said you don't want to have to go 235 00:11:54,050 --> 00:11:58,150 through this process every time you launched the container. 236 00:11:58,150 --> 00:12:02,450 So, it turns out that Docker has this cool little tool, 237 00:12:02,450 --> 00:12:04,890 which you can access through your Docker menu 238 00:12:04,890 --> 00:12:08,390 called Kitematic, which I'm going to go ahead and launch, 239 00:12:08,390 --> 00:12:10,980 it's not installed by default. 240 00:12:10,980 --> 00:12:13,230 I'll talk about the interface here in a second, 241 00:12:13,230 --> 00:12:15,480 let me go back to my browser for a moment. 242 00:12:15,480 --> 00:12:18,660 You can download this from kitematic.com, 243 00:12:18,660 --> 00:12:20,420 it's the Docker toolbox, 244 00:12:20,420 --> 00:12:24,290 as they call it, and what's nice about it is it keeps track 245 00:12:24,290 --> 00:12:27,770 of every container you've ever executed on your machine. 246 00:12:27,770 --> 00:12:31,090 And you can see this green circle to the left of the one 247 00:12:31,090 --> 00:12:32,700 that we were just talking about, 248 00:12:32,700 --> 00:12:35,110 which indicates that it's currently running. 249 00:12:35,110 --> 00:12:38,030 These are some other containers that I had 250 00:12:38,030 --> 00:12:41,750 executed in the past that are configured 251 00:12:41,750 --> 00:12:44,960 for various things, separate from the demonstration 252 00:12:44,960 --> 00:12:47,750 that I'm going to do for you here 253 00:12:47,750 --> 00:12:49,440 in the next several videos. 254 00:12:49,440 --> 00:12:53,090 Once you select a particular container, you can stop it, 255 00:12:53,090 --> 00:12:54,670 and you can restart it. 256 00:12:54,670 --> 00:12:58,480 This is important because if you simply stop and restart 257 00:12:58,480 --> 00:13:00,970 your container, rather than going back out 258 00:13:00,970 --> 00:13:03,910 to the command line to do that Docker run command, 259 00:13:03,910 --> 00:13:07,610 you will not have to go through the process of reinstalling 260 00:13:07,610 --> 00:13:09,453 all the software again. 261 00:13:10,600 --> 00:13:14,120 So, when you next come back into your system 262 00:13:14,120 --> 00:13:17,340 and want to rerun the PYSpark Notebook 263 00:13:17,340 --> 00:13:19,160 with that software installed, 264 00:13:19,160 --> 00:13:23,100 you would go to your Docker menu, open up Kitematic, 265 00:13:23,100 --> 00:13:26,600 select the particular container that you want to execute. 266 00:13:26,600 --> 00:13:29,580 And in your case, it would instead of saying stop 267 00:13:29,580 --> 00:13:31,770 and restart, you would have a run option 268 00:13:31,770 --> 00:13:34,370 because the container presumably is not running 269 00:13:34,370 --> 00:13:35,330 at that time. 270 00:13:35,330 --> 00:13:38,050 And you would then be able to launch the container, 271 00:13:38,050 --> 00:13:41,130 and separately you also have the ability to click 272 00:13:41,130 --> 00:13:45,060 this button to open up a command Window on your system 273 00:13:45,060 --> 00:13:48,160 that's already logged into that container, 274 00:13:48,160 --> 00:13:50,970 and you have the ability to click this icon 275 00:13:50,970 --> 00:13:55,970 over here to open up the main webpage for your container, 276 00:13:56,270 --> 00:13:59,610 as well, so you can basically access your container 277 00:13:59,610 --> 00:14:01,830 and manipulate it right here 278 00:14:01,830 --> 00:14:04,560 in this nice little kitematic tool. 279 00:14:04,560 --> 00:14:07,320 So, at this point we've got our Docker container 280 00:14:07,320 --> 00:14:10,770 up and running, and we are actually ready to start jumping 281 00:14:10,770 --> 00:14:13,050 into our Spark example, 282 00:14:13,050 --> 00:14:15,333 which I'll do starting with the next video.