1 00:00:06,750 --> 00:00:10,410 - Welcome to 14.1, Modern Large Language Models. 2 00:00:10,410 --> 00:00:13,590 You may be asking yourself, "Large language models? 3 00:00:13,590 --> 00:00:16,440 haven't we already been working with language models?" 4 00:00:16,440 --> 00:00:18,210 And the answer is yes. 5 00:00:18,210 --> 00:00:21,780 If you were to ask someone "What is a large language model?" 6 00:00:21,780 --> 00:00:22,652 they would say, "Well 7 00:00:22,652 --> 00:00:26,700 LLMs, large language models, are simply machine 8 00:00:26,700 --> 00:00:28,757 learning models that are trained to process 9 00:00:28,757 --> 00:00:31,611 and generate natural language text." 10 00:00:31,611 --> 00:00:33,568 But we have already seen these. 11 00:00:33,568 --> 00:00:38,568 In these lessons, we've seen BERT, GPT-2, T5, 12 00:00:39,660 --> 00:00:44,160 all of these are LLMs, large language models, 13 00:00:44,160 --> 00:00:45,810 but we're gonna shift our attention 14 00:00:45,810 --> 00:00:49,050 to massively large language models. 15 00:00:49,050 --> 00:00:52,499 Models like GPT three and beyond. 16 00:00:52,499 --> 00:00:56,070 These contain billions of parameters, 17 00:00:56,070 --> 00:00:58,620 whereas our previous models we've seen were 18 00:00:58,620 --> 00:01:02,610 in the meager hundreds of millions of parameters. 19 00:01:02,610 --> 00:01:04,490 But these massive language models are 20 00:01:04,490 --> 00:01:08,326 on a order of magnitude larger in the number 21 00:01:08,326 --> 00:01:10,830 of parameters that they contain, 22 00:01:10,830 --> 00:01:14,820 and also in the size of the data sets they were trained on. 23 00:01:14,820 --> 00:01:17,665 And again, we've talked about pre-training before. 24 00:01:17,665 --> 00:01:21,381 We've seen web text that GPT-2 is trained on. 25 00:01:21,381 --> 00:01:26,100 We've talked about common crawl that T5 was trained on, 26 00:01:26,100 --> 00:01:28,710 but these massively large language models, 27 00:01:28,710 --> 00:01:33,570 these massive LLMs, these models can perform a wide range 28 00:01:33,570 --> 00:01:37,322 of language tasks ranging from translation, summarization, 29 00:01:37,322 --> 00:01:38,756 question and answering. 30 00:01:38,756 --> 00:01:42,124 And as we'll see a lot more, with absolutely 31 00:01:42,124 --> 00:01:45,074 no further fine tuning required. 32 00:01:45,074 --> 00:01:49,506 Coming back to the idea of size, we've saw, for example 33 00:01:49,506 --> 00:01:54,330 BERT has around 110 million parameters, at least the base 34 00:01:54,330 --> 00:01:58,320 model of BERT as it was introduced in 2018, 35 00:01:58,320 --> 00:02:00,870 which is still considered large. 36 00:02:00,870 --> 00:02:04,063 But again, when we talk about massively large, we're talking 37 00:02:04,063 --> 00:02:07,530 in the order of, for example, GPT-3, 38 00:02:07,530 --> 00:02:12,530 which has 175 billion parameters to its name. 39 00:02:13,020 --> 00:02:15,759 And this number will only go up as these models 40 00:02:15,759 --> 00:02:17,970 progress in advancements. 41 00:02:17,970 --> 00:02:21,368 And they are obviously comparatively massive, 42 00:02:21,368 --> 00:02:25,230 but size is not the only factor to consider here. 43 00:02:25,230 --> 00:02:28,110 Bigger does not mean better for all 44 00:02:28,110 --> 00:02:30,047 natural language processing tasks. 45 00:02:30,047 --> 00:02:33,727 BERT still achieves strong results and in some cases 46 00:02:33,727 --> 00:02:37,892 stronger results than these massively large language models 47 00:02:37,892 --> 00:02:41,822 on particular types of tasks that BERT is suited for, like 48 00:02:41,822 --> 00:02:44,712 for example, sequence classification, which we 49 00:02:44,712 --> 00:02:46,293 have seen before. 50 00:02:47,490 --> 00:02:50,130 To work with these massively large language models, 51 00:02:50,130 --> 00:02:53,040 we are generally talking about what are called 52 00:02:53,040 --> 00:02:55,410 closed sourced types of models, 53 00:02:55,410 --> 00:02:58,070 meaning they are owned and operated by 54 00:02:58,070 --> 00:03:01,964 a company or organization that creates them. 55 00:03:01,964 --> 00:03:05,666 And to interact with them, we generally have to use 56 00:03:05,666 --> 00:03:09,732 either what's called a Playground, or an API. 57 00:03:09,732 --> 00:03:13,313 Playgrounds are graphical interfaces to play 58 00:03:13,313 --> 00:03:17,467 with and also develop with these LLMs. 59 00:03:17,467 --> 00:03:20,365 And we'll see an example of that in this lesson. 60 00:03:20,365 --> 00:03:21,680 We'll also see an example 61 00:03:21,680 --> 00:03:24,660 of interacting with these massive LLMs 62 00:03:24,660 --> 00:03:28,022 through an API, an application programming interface, 63 00:03:28,022 --> 00:03:31,950 which is just a programmatic interface to the LLM, 64 00:03:31,950 --> 00:03:35,316 meaning we'll be able to actually call the LLM using 65 00:03:35,316 --> 00:03:38,730 in our case, Python from a Jupyter notebook, 66 00:03:38,730 --> 00:03:43,099 which we will also see in our examples later in this lesson. 67 00:03:43,099 --> 00:03:48,099 For example, if we wanted to interact with say, GPT-3, 68 00:03:48,300 --> 00:03:51,150 which is owned and operated by OpenAI, 69 00:03:51,150 --> 00:03:53,700 the same people who made GPT-2 that we saw 70 00:03:53,700 --> 00:03:57,980 in earlier lessons, you would have to use either their API 71 00:03:57,980 --> 00:04:01,185 or as we're looking at here, their playground. 72 00:04:01,185 --> 00:04:03,109 This is literally a website 73 00:04:03,109 --> 00:04:05,791 that you can go to to talk to and interact with 74 00:04:05,791 --> 00:04:09,850 GPT-3, the massive large language model. 75 00:04:09,850 --> 00:04:12,577 As you can see, the playground is mostly taken 76 00:04:12,577 --> 00:04:14,790 up by this large text area 77 00:04:14,790 --> 00:04:16,793 in the middle, where you actually get to type in 78 00:04:16,793 --> 00:04:21,081 a request or what we'll come to call a prompt 79 00:04:21,081 --> 00:04:24,337 to the LLM and see its response. 80 00:04:24,337 --> 00:04:25,797 However, you'll notice 81 00:04:25,797 --> 00:04:28,323 on the right we do have some other options, 82 00:04:28,323 --> 00:04:31,516 and some of these options should look familiar to you. 83 00:04:31,516 --> 00:04:34,200 We see, for example, temperature, 84 00:04:34,200 --> 00:04:37,801 which means exactly the same thing as it did with GPT-2. 85 00:04:37,801 --> 00:04:40,560 It is an inference parameter that is used to 86 00:04:40,560 --> 00:04:44,760 toggle the randomness of the model's output. 87 00:04:44,760 --> 00:04:48,360 And we'll see these in action when we turn our attention 88 00:04:48,360 --> 00:04:51,480 to actually using the playground for ourselves. 89 00:04:51,480 --> 00:04:53,267 But pretty much all playgrounds will look 90 00:04:53,267 --> 00:04:54,930 something like this. 91 00:04:54,930 --> 00:04:57,630 You have a text area, you type in a request 92 00:04:57,630 --> 00:04:59,875 to the LLM, you got a response back 93 00:04:59,875 --> 00:05:03,399 and what you do with that response is now up to you. 94 00:05:03,399 --> 00:05:05,730 So for example, 95 00:05:05,730 --> 00:05:09,826 if I wanted to ask GPT-3 to write me a tweet talking 96 00:05:09,826 --> 00:05:13,752 about how great GPT-3 is, it might look something 97 00:05:13,752 --> 00:05:17,755 like this, in that large text area of the playground, 98 00:05:17,755 --> 00:05:22,755 I would type in my prompt, my request to the LLM. 99 00:05:22,860 --> 00:05:27,491 In this case, that prompt is "Write a tweet talking 100 00:05:27,491 --> 00:05:30,901 about how great GPT-3 is. 101 00:05:30,901 --> 00:05:35,003 highlighted in green is GPT-3 response 102 00:05:35,003 --> 00:05:39,787 to me, giving me exactly what I asked for. 103 00:05:39,787 --> 00:05:41,130 "GPT-3 is amazing. 104 00:05:41,130 --> 00:05:43,140 It's already revolutionizing AI, 105 00:05:43,140 --> 00:05:44,580 and it's only getting better." 106 00:05:44,580 --> 00:05:45,743 With appropriate hashtags, 107 00:05:45,743 --> 00:05:48,815 because I asked specifically for a tweet. 108 00:05:48,815 --> 00:05:53,190 Now this is not a summarization task, nor is it a question 109 00:05:53,190 --> 00:05:55,800 and answering task, nor is it really one 110 00:05:55,800 --> 00:05:59,366 of the structured NLP tasks that we've talked 111 00:05:59,366 --> 00:06:01,470 about throughout our lessons. 112 00:06:01,470 --> 00:06:04,260 And this is what has caught the attention 113 00:06:04,260 --> 00:06:06,490 and the eyes of so many people, is that 114 00:06:06,490 --> 00:06:10,262 as you take these large language models and you enter 115 00:06:10,262 --> 00:06:13,483 into the realm of massively large language models, 116 00:06:13,483 --> 00:06:17,490 we start to see a great enhancement 117 00:06:17,490 --> 00:06:22,369 in the types of tasks or just day-to-day requests 118 00:06:22,369 --> 00:06:26,190 that these language models can perform for us. 119 00:06:26,190 --> 00:06:28,260 We stop thinking so much 120 00:06:28,260 --> 00:06:30,768 in terms of "What sequences am I classifying 121 00:06:30,768 --> 00:06:34,742 or is this abstractive or extractive summarization?" 122 00:06:34,742 --> 00:06:38,130 And you start to interact with these models on more 123 00:06:38,130 --> 00:06:42,270 of an everyday, "Well, can you help me do this small task? 124 00:06:42,270 --> 00:06:43,603 Can you help me solve this? 125 00:06:43,603 --> 00:06:46,170 Can you help me rewrite this paragraph? 126 00:06:46,170 --> 00:06:50,880 For example, for let's say my resume or my cover letter, 127 00:06:50,880 --> 00:06:54,504 can you help me do these text-based tasks that I'm 128 00:06:54,504 --> 00:06:58,865 having that aren't necessarily structured academic tasks?" 129 00:06:58,865 --> 00:07:01,800 And this is what has really popularized these 130 00:07:01,800 --> 00:07:04,938 large language models and thrust them into the media, 131 00:07:04,938 --> 00:07:08,613 and also into the hands of everyday developers. 132 00:07:10,590 --> 00:07:14,160 So here is that GPT-3 playground that I was talking about. 133 00:07:14,160 --> 00:07:17,386 In its real form, I am actually on the playground. 134 00:07:17,386 --> 00:07:20,394 So if I wanted to interact with the model, 135 00:07:20,394 --> 00:07:21,616 I can just start typing 136 00:07:21,616 --> 00:07:24,876 in any kind of instruction that I have. 137 00:07:24,876 --> 00:07:27,715 For example, let's say I was tasked 138 00:07:27,715 --> 00:07:30,897 with planning a birthday party 139 00:07:30,897 --> 00:07:34,440 for a five-year-old who happens to Marvel and Disney. 140 00:07:34,440 --> 00:07:38,000 So I'm gonna ask GPT-3, straight up, 141 00:07:38,000 --> 00:07:40,570 what are some birthday party ideas 142 00:07:40,570 --> 00:07:44,790 for a five-year-old who loves Marvel and Disney? 143 00:07:44,790 --> 00:07:46,920 Now, for the most part, on the side here 144 00:07:46,920 --> 00:07:49,770 I get to choose a lot of my parameters, 145 00:07:49,770 --> 00:07:52,860 and I've actually not toggled any of these parameters. 146 00:07:52,860 --> 00:07:55,445 So we're gonna see just a straight answer 147 00:07:55,445 --> 00:07:57,797 from the model as is. 148 00:07:57,797 --> 00:08:01,500 To do this, I'll hit submit, take my hands off the screen 149 00:08:01,500 --> 00:08:05,550 and all of a sudden GPT-3, like its cousin 150 00:08:05,550 --> 00:08:09,389 GPT-2, being an auto aggressive language model, 151 00:08:09,389 --> 00:08:11,293 is going to start thinking token 152 00:08:11,293 --> 00:08:16,293 by token about a response to my command, to my instruction. 153 00:08:16,633 --> 00:08:19,110 So it's given me four ideas. 154 00:08:19,110 --> 00:08:21,960 I'm not gonna read off all of them, but they range 155 00:08:21,960 --> 00:08:25,680 from holding a superhero themed party, all the way 156 00:08:25,680 --> 00:08:28,500 to having a combined Marvel Disney party 157 00:08:28,500 --> 00:08:30,630 with decorations and activities. 158 00:08:30,630 --> 00:08:32,487 Now, this could be the end of it. 159 00:08:32,487 --> 00:08:34,922 We could be done here, but 160 00:08:34,922 --> 00:08:38,939 if I wanna continue this conversation with the LLM, 161 00:08:38,939 --> 00:08:42,382 I could hit enter a few times in the playground and say 162 00:08:42,382 --> 00:08:47,382 "Can you tell me more about the third option?" 163 00:08:50,070 --> 00:08:52,860 Now, if we're thinking in terms of an LLM or just 164 00:08:52,860 --> 00:08:56,419 a language language model in general, we 165 00:08:56,419 --> 00:09:00,540 kind of now understand how the language model is thinking. 166 00:09:00,540 --> 00:09:03,356 It's going to take all of this information 167 00:09:03,356 --> 00:09:06,360 as a prompt or an input 168 00:09:06,360 --> 00:09:10,816 to the model, and generate a second output to our question. 169 00:09:10,816 --> 00:09:15,816 This is not unlike how say T5 or GPT-2 works, 170 00:09:16,290 --> 00:09:18,750 using an input to the model 171 00:09:18,750 --> 00:09:20,679 we would ask it a certain task 172 00:09:20,679 --> 00:09:23,391 and ask it to solve that task. 173 00:09:23,391 --> 00:09:27,822 So if I hit submit again, it understands while 174 00:09:27,822 --> 00:09:32,790 the third option it gave me, being the Marvel themed party 175 00:09:32,790 --> 00:09:35,587 with a Marvel inspired cake and decorations, 176 00:09:35,587 --> 00:09:38,818 it understands, it remembers rather, 177 00:09:38,818 --> 00:09:43,440 or I should say more literally, it can see the third option 178 00:09:43,440 --> 00:09:46,316 that it itself wrote and say, "Yeah, sure." 179 00:09:46,316 --> 00:09:49,500 For the third option, you could host a Marvel themed party 180 00:09:49,500 --> 00:09:51,790 with decorations featuring Marvel characters, 181 00:09:51,790 --> 00:09:54,030 serving a Marvel inspired cake, 182 00:09:54,030 --> 00:09:55,546 and have activities like Marvel movie 183 00:09:55,546 --> 00:09:59,580 marathon, arts and crafts, and scavenger hunts with prizes. 184 00:09:59,580 --> 00:10:03,060 Now, if this is all sounding a little common 185 00:10:03,060 --> 00:10:07,470 and not so interesting, I'm gonna throw that back and say, 186 00:10:07,470 --> 00:10:08,499 but doesn't this open up 187 00:10:08,499 --> 00:10:11,109 an infinite number of possibilities? 188 00:10:11,109 --> 00:10:14,258 It now understands what Marvel and Disney are. 189 00:10:14,258 --> 00:10:17,731 It knows what parties are like for five-year-olds. 190 00:10:17,731 --> 00:10:20,190 The world is now my oyster. 191 00:10:20,190 --> 00:10:24,390 I don't have to be a natural language processing specialist 192 00:10:24,390 --> 00:10:25,759 or a machine learning engineer 193 00:10:25,759 --> 00:10:29,403 and train a model to understand what parties are. 194 00:10:29,403 --> 00:10:34,296 At the massive level, we start to see general information 195 00:10:34,296 --> 00:10:38,431 and general knowledge being encoded directly 196 00:10:38,431 --> 00:10:41,589 into the parameters of these auto aggressive 197 00:10:41,589 --> 00:10:45,960 language models, and we can see the result 198 00:10:45,960 --> 00:10:50,960 of that being very fine-tuned answers to our question. 199 00:10:50,970 --> 00:10:53,820 Now I'm painting in broad strokes here, 200 00:10:53,820 --> 00:10:55,617 because later on in our lesson we're going to 201 00:10:55,617 --> 00:10:59,036 see exactly how this came to be. 202 00:10:59,036 --> 00:11:01,358 Because if we were to type something like this 203 00:11:01,358 --> 00:11:05,569 into GPT-2 for example, we would not get anything close 204 00:11:05,569 --> 00:11:09,372 to a specific answer like we're seeing here. 205 00:11:09,372 --> 00:11:13,439 That comes with a special edition to GPT-3 206 00:11:13,439 --> 00:11:17,223 courtesy of it's creator, OpenAI.