1 00:00:00,000 --> 00:00:04,051 We just talked about the concept of 2 00:00:04,052 --> 00:00:07,514 training a model in the previous lecture. 3 00:00:07,515 --> 00:00:10,880 In that context, it is important to understand 4 00:00:11,410 --> 00:00:14,340 that this is probably the hardest and most 5 00:00:14,341 --> 00:00:16,868 complicated part of the whole process 6 00:00:16,869 --> 00:00:21,332 when developing a machine learning solution. Training a 7 00:00:21,333 --> 00:00:24,420 model is not an easy task like just 8 00:00:24,421 --> 00:00:28,348 run some algorithm on the training dataset 9 00:00:28,349 --> 00:00:32,090 and we will get the best trained model. 10 00:00:32,091 --> 00:00:36,674 The challenge is to make the model more generic, 11 00:00:36,675 --> 00:00:41,568 making sure it is performing well on unseen data. 12 00:00:41,569 --> 00:00:45,030 We need to remember that the learning algorithm 13 00:00:45,031 --> 00:00:48,416 created the model while trying to optimize something. 14 00:00:48,417 --> 00:00:50,486 It's all about optimization. 15 00:00:50,487 --> 00:00:55,524 It's a process of adjusting a model step by step 16 00:00:55,525 --> 00:01:00,372 to get the best performance on the training dataset. 17 00:01:00,373 --> 00:01:03,998 On the other end, the objective of the machine 18 00:01:03,999 --> 00:01:07,256 learning system is to be able to make good 19 00:01:07,257 --> 00:01:11,160 prediction on data it has never seen before. 20 00:01:11,161 --> 00:01:13,910 This is called generalization. 21 00:01:13,911 --> 00:01:18,220 A well generalized model is a model where 22 00:01:18,221 --> 00:01:22,412 the patterns learned from the examples provided in 23 00:01:22,413 --> 00:01:26,316 the training dataset can be successfully used 24 00:01:26,317 --> 00:01:31,710 also on new unseen data instances. 25 00:01:31,711 --> 00:01:36,374 That's the whole objective of any machine learning solution 26 00:01:36,375 --> 00:01:40,256 to make good prediction on new data, okay, 27 00:01:40,257 --> 00:01:42,400 not on the training dataset. 28 00:01:43,090 --> 00:01:46,554 As I mentioned, this is not an easy task 29 00:01:46,555 --> 00:01:51,492 and there are two main challenges to overcome before 30 00:01:51,493 --> 00:01:54,630 we can get a well generalized model. 31 00:01:54,631 --> 00:01:57,416 Those challenges are called 32 00:01:57,417 --> 00:01:59,890 underfitting and overfitting. 33 00:02:01,670 --> 00:02:06,908 Starting with underfitting, underfitting refers to a 34 00:02:06,909 --> 00:02:10,538 situation that the trained model is not 35 00:02:10,539 --> 00:02:13,612 working well on the training data and 36 00:02:13,613 --> 00:02:17,596 of course cannot generalize to new data. 37 00:02:17,597 --> 00:02:21,616 The trained model didn't capture the 38 00:02:21,617 --> 00:02:24,528 underlying structure of the data. 39 00:02:24,529 --> 00:02:27,152 If this is the end result of the 40 00:02:27,153 --> 00:02:30,560 learning algorithm, then something is not working. 41 00:02:31,090 --> 00:02:34,858 Take a look on the following simple two charts. 42 00:02:34,859 --> 00:02:39,410 We have multiple points as the training dataset 43 00:02:39,411 --> 00:02:45,390 and one linear line created by some learning algorithm 44 00:02:45,391 --> 00:02:48,536 which was supposed to represent the model. 45 00:02:48,537 --> 00:02:50,870 The line itself is the model. 46 00:02:50,871 --> 00:02:54,360 We can easily see that on the left side, 47 00:02:54,361 --> 00:02:58,386 this linear line is not really representing the patterns 48 00:02:58,387 --> 00:03:02,810 in the data, which is the problem of underfitting. 49 00:03:02,811 --> 00:03:06,994 On the other end, the line in the second graph 50 00:03:06,995 --> 00:03:11,766 is not linear and can better represent the patterns. 51 00:03:11,767 --> 00:03:14,940 It has better fit to the training data. 52 00:03:16,430 --> 00:03:20,410 Now, what are the main reasons for underfitting? 53 00:03:20,930 --> 00:03:24,676 The first one is that the model is probably too 54 00:03:24,677 --> 00:03:28,772 simple and we need to build a more complex model 55 00:03:28,773 --> 00:03:33,544 that can better learn the underlying structure of the data. 56 00:03:33,545 --> 00:03:35,352 In that case, it makes sense 57 00:03:35,353 --> 00:03:38,206 to try a different learning algorithm. 58 00:03:38,207 --> 00:03:42,062 For example, here we moved from an algorithm 59 00:03:42,063 --> 00:03:44,648 that will build a linear line to an 60 00:03:44,649 --> 00:03:47,692 algorithm that can build non-linear line, okay. 61 00:03:47,693 --> 00:03:51,468 That will better represent the underlying data. 62 00:03:51,469 --> 00:03:54,428 The second reason for underfitting is that the 63 00:03:54,429 --> 00:03:57,630 training dataset is not good enough. 64 00:03:57,631 --> 00:04:01,216 Maybe there are not enough examples, or 65 00:04:01,217 --> 00:04:04,582 maybe the input features of the provided 66 00:04:04,583 --> 00:04:07,552 example are not informative enough. 67 00:04:07,553 --> 00:04:11,668 Like providing an algorithm just the size 68 00:04:11,669 --> 00:04:16,329 of a house without other related features, 69 00:04:16,330 --> 00:04:21,140 it's not enough. On the other end, 70 00:04:21,141 --> 00:04:23,752 underfitting is also a standard 71 00:04:23,753 --> 00:04:26,950 transition phase of any training model. 72 00:04:26,951 --> 00:04:30,286 During the training process, the learning algorithm 73 00:04:30,287 --> 00:04:33,256 will build and adjust the model while 74 00:04:33,257 --> 00:04:36,786 performing a certain number of iteration. 75 00:04:36,787 --> 00:04:39,404 At the beginning of training, the model will 76 00:04:39,405 --> 00:04:42,524 underfit the training data because it is just 77 00:04:42,525 --> 00:04:45,106 started to model the relevant patterns. 78 00:04:45,107 --> 00:04:49,782 And in each learning iteration, the model performance 79 00:04:49,783 --> 00:04:52,784 should improve again and again, making the model 80 00:04:52,785 --> 00:04:55,520 a much better fit to the training data. 81 00:04:55,521 --> 00:04:58,768 Okay, so it will be a transition phase from 82 00:04:58,769 --> 00:05:03,120 an underfitting model until to a fitting model. 83 00:05:04,290 --> 00:05:08,372 If we keep trying to improve the model, there is a 84 00:05:08,373 --> 00:05:12,602 danger that we can create a model that is overfitting 85 00:05:12,603 --> 00:05:16,870 the dataset and now are moving to the second challenge. 86 00:05:16,871 --> 00:05:20,472 After reaching some optimum point when the algorithm is 87 00:05:20,473 --> 00:05:24,526 running over the dataset, the model testing performance 88 00:05:24,527 --> 00:05:27,916 will start to degrade, which means the model is 89 00:05:27,917 --> 00:05:33,148 starting to overfit the training data learning patterns that 90 00:05:33,149 --> 00:05:37,132 are too specific to the training data and will 91 00:05:37,133 --> 00:05:39,420 be irrelevant to new data. 92 00:05:40,190 --> 00:05:44,502 As a simple analogy, let's say we just bought 93 00:05:44,503 --> 00:05:50,500 a few grocery items in the local supermarket, like 94 00:05:50,501 --> 00:05:56,052 20 items during our vacation in a different country. 95 00:05:56,053 --> 00:05:59,786 The overall price of that basket 96 00:05:59,787 --> 00:06:03,194 was relatively cheaper than our expectation. 97 00:06:03,195 --> 00:06:06,472 It surprised us in a positive way. 98 00:06:06,473 --> 00:06:09,528 And when going back home, we told our friends 99 00:06:09,529 --> 00:06:15,910 that the supermarkets in that country are unbelievable cheaper. 100 00:06:16,570 --> 00:06:20,470 What do you think, is it a reasonable conclusion? 101 00:06:21,850 --> 00:06:23,836 Well, not really. 102 00:06:23,837 --> 00:06:30,096 We just overgeneralized a pattern about something from a 103 00:06:30,097 --> 00:06:34,416 relatively very small amount of samples for that small 104 00:06:34,417 --> 00:06:38,230 amount of items we bought in the supermarket. 105 00:06:38,231 --> 00:06:39,376 It is making sense. 106 00:06:39,377 --> 00:06:42,930 It is nicely fitting with the conclusion. 107 00:06:42,931 --> 00:06:47,124 Maybe we are right and maybe we are completely wrong. 108 00:06:47,125 --> 00:06:50,052 It makes sense to check a much larger amount 109 00:06:50,053 --> 00:06:54,270 of items in the supermarket before making a conclusion. 110 00:06:54,271 --> 00:06:56,488 Now when the same thing happens in 111 00:06:56,489 --> 00:06:59,694 machine learning, it is called overfitting. 112 00:06:59,695 --> 00:07:01,624 Overfitting is a very common 113 00:07:01,625 --> 00:07:03,758 situation when training models. 114 00:07:03,759 --> 00:07:07,362 It means that the trained model we created 115 00:07:07,363 --> 00:07:11,388 performs very well on the training data, but 116 00:07:11,389 --> 00:07:14,924 it does not generalized well to new data. 117 00:07:14,925 --> 00:07:18,860 The model is not performing well on new data. 118 00:07:20,190 --> 00:07:23,238 Looking on the same graph when drawing 119 00:07:23,239 --> 00:07:25,856 a line that is perfectly connecting the 120 00:07:25,857 --> 00:07:29,862 points is an example of overfitting. 121 00:07:29,863 --> 00:07:32,020 For those points in our training 122 00:07:32,021 --> 00:07:33,680 dataset it is perfect. 123 00:07:34,210 --> 00:07:36,996 But when we use it for new data 124 00:07:36,997 --> 00:07:40,160 points, this model will not perform so well. 125 00:07:41,090 --> 00:07:43,816 But why when using the training data 126 00:07:43,817 --> 00:07:47,380 set we encounter such an overfitting situation? 127 00:07:48,550 --> 00:07:52,094 Well, there are few common reasons. 128 00:07:52,095 --> 00:07:54,536 The training dataset is a 129 00:07:54,537 --> 00:07:58,018 sample of much larger distribution. 130 00:07:58,019 --> 00:08:01,932 If we take 100 items in a supermarket to be 131 00:08:01,933 --> 00:08:05,308 compared on a price level with the same items in 132 00:08:05,309 --> 00:08:09,494 a different supermarket, it is a small sample. 133 00:08:09,495 --> 00:08:12,208 It's not all the items available in the 134 00:08:12,209 --> 00:08:15,798 supermarket, which can be millions of items. 135 00:08:15,799 --> 00:08:19,824 Maybe if we will compare 1000 items or 136 00:08:19,825 --> 00:08:24,138 10,000 items, we will get much better distribution 137 00:08:24,139 --> 00:08:27,310 of data about the items in the supermarket. 138 00:08:28,450 --> 00:08:31,028 The same challenge is when training a 139 00:08:31,029 --> 00:08:33,207 model using the training dataset. 140 00:08:33,208 --> 00:08:35,342 The training dataset is a sample. 141 00:08:35,343 --> 00:08:37,501 It is a group of examples. 142 00:08:37,502 --> 00:08:42,381 It should be a large sample size that will resemble 143 00:08:42,382 --> 00:08:46,460 as good as possible the true distribution of the data. 144 00:08:46,461 --> 00:08:48,828 This is the key issue to remember. 145 00:08:48,829 --> 00:08:52,530 The training data should represent the distribution 146 00:08:52,531 --> 00:08:55,468 of the data as much as possible, 147 00:08:55,469 --> 00:08:59,580 otherwise it will just overfit the training data. 148 00:09:00,110 --> 00:09:04,064 The next reason it can be also too complex model. 149 00:09:04,065 --> 00:09:09,120 The objective of a model is to fit the data well, but 150 00:09:09,121 --> 00:09:12,244 at the same time fit the data as simple as possible. 151 00:09:12,245 --> 00:09:14,106 It is a careful balance. 152 00:09:14,107 --> 00:09:17,140 If the model is too complex while trying 153 00:09:17,141 --> 00:09:21,236 to fit perfectly the training data, then we 154 00:09:21,237 --> 00:09:25,336 increase the risk of overfitting anyway. 155 00:09:25,337 --> 00:09:27,496 Even if we took a very large data 156 00:09:27,497 --> 00:09:31,368 set, how can we discover such problems? 157 00:09:31,369 --> 00:09:34,712 How can we trust that the model will 158 00:09:34,713 --> 00:09:37,500 also make a good job on new data? 159 00:09:37,501 --> 00:09:40,348 Maybe it is overfitting the training dataset 160 00:09:40,349 --> 00:09:44,010 and it's not generalized well to new data. 161 00:09:44,011 --> 00:09:47,132 The answer is that we will need to test the 162 00:09:47,133 --> 00:09:51,632 model performance on a separate dataset to check and 163 00:09:51,633 --> 00:09:56,128 validate that our model is working well on new data. 164 00:09:56,129 --> 00:09:59,248 It is called the test dataset. 165 00:09:59,249 --> 00:10:01,088 The concept is quite simple. 166 00:10:01,089 --> 00:10:04,628 We will have a group of examples to train a 167 00:10:04,629 --> 00:10:07,956 model and another group of example to test the model. 168 00:10:07,957 --> 00:10:11,572 We'll talk about it later in this training. As 169 00:10:11,573 --> 00:10:13,940 a quick summary of the key things to remember. 170 00:10:14,470 --> 00:10:16,990 Training a model is not an easy task. 171 00:10:16,991 --> 00:10:20,056 It's actually the core job of a data 172 00:10:20,057 --> 00:10:24,970 scientist when building a machine learning project. 173 00:10:24,971 --> 00:10:26,972 The challenge is to make 174 00:10:26,973 --> 00:10:29,202 the trained model more generic. 175 00:10:29,203 --> 00:10:31,724 Making sure it is performing well on 176 00:10:31,725 --> 00:10:36,018 unseen data, making the model well generalized. 177 00:10:36,019 --> 00:10:37,346 There are two main challenges 178 00:10:37,347 --> 00:10:40,914 to overcome, underfitting and overfitting. 179 00:10:40,915 --> 00:10:44,080 Underfitting is when the training model is not working 180 00:10:44,081 --> 00:10:47,414 well on the training dataset and of course cannot 181 00:10:47,415 --> 00:10:51,430 generalize to new data. Overfitting which is more complex 182 00:10:51,431 --> 00:10:54,106 problem is when the training model we created 183 00:10:54,107 --> 00:10:57,492 perform very well on the training data, but 184 00:10:57,493 --> 00:11:00,276 it does not generalize well to new data. 185 00:11:00,277 --> 00:11:05,300 The model is not performing well on new data. Okay. 186 00:11:05,301 --> 00:11:07,348 Overall, this section was a high level 187 00:11:07,349 --> 00:11:11,330 introduction to the basic machine learning terminology. 188 00:11:11,331 --> 00:11:13,044 Please post a question if you 189 00:11:13,045 --> 00:11:14,580 would like to ask something. 190 00:11:14,581 --> 00:11:17,388 In the next section, we are going to talk 191 00:11:17,389 --> 00:11:21,240 about the main classification of machine learning system. 192 00:11:21,241 --> 00:11:23,827 [No audio]