1
00:00:00,000 --> 00:00:04,051
We just talked about the concept of

2
00:00:04,052 --> 00:00:07,514
training a model in the previous lecture.

3
00:00:07,515 --> 00:00:10,880
In that context, it is important to understand

4
00:00:11,410 --> 00:00:14,340
that this is probably the hardest and most

5
00:00:14,341 --> 00:00:16,868
complicated part of the whole process

6
00:00:16,869 --> 00:00:21,332
when developing a machine learning solution. Training a

7
00:00:21,333 --> 00:00:24,420
model is not an easy task like just

8
00:00:24,421 --> 00:00:28,348
run some algorithm on the training dataset

9
00:00:28,349 --> 00:00:32,090
and we will get the best trained model.

10
00:00:32,091 --> 00:00:36,674
The challenge is to make the model more generic,

11
00:00:36,675 --> 00:00:41,568
making sure it is performing well on unseen data.

12
00:00:41,569 --> 00:00:45,030
We need to remember that the learning algorithm

13
00:00:45,031 --> 00:00:48,416
created the model while trying to optimize something.

14
00:00:48,417 --> 00:00:50,486
It's all about optimization.

15
00:00:50,487 --> 00:00:55,524
It's a process of adjusting a model step by step

16
00:00:55,525 --> 00:01:00,372
to get the best performance on the training dataset.

17
00:01:00,373 --> 00:01:03,998
On the other end, the objective of the machine

18
00:01:03,999 --> 00:01:07,256
learning system is to be able to make good

19
00:01:07,257 --> 00:01:11,160
prediction on data it has never seen before.

20
00:01:11,161 --> 00:01:13,910
This is called generalization.

21
00:01:13,911 --> 00:01:18,220
A well generalized model is a model where

22
00:01:18,221 --> 00:01:22,412
the patterns learned from the examples provided in

23
00:01:22,413 --> 00:01:26,316
the training dataset can be successfully used

24
00:01:26,317 --> 00:01:31,710
also on new unseen data instances.

25
00:01:31,711 --> 00:01:36,374
That's the whole objective of any machine learning solution

26
00:01:36,375 --> 00:01:40,256
to make good prediction on new data, okay,

27
00:01:40,257 --> 00:01:42,400
not on the training dataset.

28
00:01:43,090 --> 00:01:46,554
As I mentioned, this is not an easy task

29
00:01:46,555 --> 00:01:51,492
and there are two main challenges to overcome before

30
00:01:51,493 --> 00:01:54,630
we can get a well generalized model.

31
00:01:54,631 --> 00:01:57,416
Those challenges are called

32
00:01:57,417 --> 00:01:59,890
underfitting and overfitting.

33
00:02:01,670 --> 00:02:06,908
Starting with underfitting, underfitting refers to a

34
00:02:06,909 --> 00:02:10,538
situation that the trained model is not

35
00:02:10,539 --> 00:02:13,612
working well on the training data and

36
00:02:13,613 --> 00:02:17,596
of course cannot generalize to new data.

37
00:02:17,597 --> 00:02:21,616
The trained model didn't capture the

38
00:02:21,617 --> 00:02:24,528
underlying structure of the data.

39
00:02:24,529 --> 00:02:27,152
If this is the end result of the

40
00:02:27,153 --> 00:02:30,560
learning algorithm, then something is not working.

41
00:02:31,090 --> 00:02:34,858
Take a look on the following simple two charts.

42
00:02:34,859 --> 00:02:39,410
We have multiple points as the training dataset

43
00:02:39,411 --> 00:02:45,390
and one linear line created by some learning algorithm

44
00:02:45,391 --> 00:02:48,536
which was supposed to represent the model.

45
00:02:48,537 --> 00:02:50,870
The line itself is the model.

46
00:02:50,871 --> 00:02:54,360
We can easily see that on the left side,

47
00:02:54,361 --> 00:02:58,386
this linear line is not really representing the patterns

48
00:02:58,387 --> 00:03:02,810
in the data, which is the problem of underfitting.

49
00:03:02,811 --> 00:03:06,994
On the other end, the line in the second graph

50
00:03:06,995 --> 00:03:11,766
is not linear and can better represent the patterns.

51
00:03:11,767 --> 00:03:14,940
It has better fit to the training data.

52
00:03:16,430 --> 00:03:20,410
Now, what are the main reasons for underfitting?

53
00:03:20,930 --> 00:03:24,676
The first one is that the model is probably too

54
00:03:24,677 --> 00:03:28,772
simple and we need to build a more complex model

55
00:03:28,773 --> 00:03:33,544
that can better learn the underlying structure of the data.

56
00:03:33,545 --> 00:03:35,352
In that case, it makes sense

57
00:03:35,353 --> 00:03:38,206
to try a different learning algorithm.

58
00:03:38,207 --> 00:03:42,062
For example, here we moved from an algorithm

59
00:03:42,063 --> 00:03:44,648
that will build a linear line to an

60
00:03:44,649 --> 00:03:47,692
algorithm that can build non-linear line, okay.

61
00:03:47,693 --> 00:03:51,468
That will better represent the underlying data.

62
00:03:51,469 --> 00:03:54,428
The second reason for underfitting is that the

63
00:03:54,429 --> 00:03:57,630
training dataset is not good enough.

64
00:03:57,631 --> 00:04:01,216
Maybe there are not enough examples, or

65
00:04:01,217 --> 00:04:04,582
maybe the input features of the provided

66
00:04:04,583 --> 00:04:07,552
example are not informative enough.

67
00:04:07,553 --> 00:04:11,668
Like providing an algorithm just the size

68
00:04:11,669 --> 00:04:16,329
of a house without other related features,

69
00:04:16,330 --> 00:04:21,140
it's not enough. On the other end,

70
00:04:21,141 --> 00:04:23,752
underfitting is also a standard

71
00:04:23,753 --> 00:04:26,950
transition phase of any training model.

72
00:04:26,951 --> 00:04:30,286
During the training process, the learning algorithm

73
00:04:30,287 --> 00:04:33,256
will build and adjust the model while

74
00:04:33,257 --> 00:04:36,786
performing a certain number of iteration.

75
00:04:36,787 --> 00:04:39,404
At the beginning of training, the model will

76
00:04:39,405 --> 00:04:42,524
underfit the training data because it is just

77
00:04:42,525 --> 00:04:45,106
started to model the relevant patterns.

78
00:04:45,107 --> 00:04:49,782
And in each learning iteration, the model performance

79
00:04:49,783 --> 00:04:52,784
should improve again and again, making the model

80
00:04:52,785 --> 00:04:55,520
a much better fit to the training data.

81
00:04:55,521 --> 00:04:58,768
Okay, so it will be a transition phase from

82
00:04:58,769 --> 00:05:03,120
an underfitting model until to a fitting model.

83
00:05:04,290 --> 00:05:08,372
If we keep trying to improve the model, there is a

84
00:05:08,373 --> 00:05:12,602
danger that we can create a model that is overfitting

85
00:05:12,603 --> 00:05:16,870
the dataset and now are moving to the second challenge.

86
00:05:16,871 --> 00:05:20,472
After reaching some optimum point when the algorithm is

87
00:05:20,473 --> 00:05:24,526
running over the dataset, the model testing performance

88
00:05:24,527 --> 00:05:27,916
will start to degrade, which means the model is

89
00:05:27,917 --> 00:05:33,148
starting to overfit the training data learning patterns that

90
00:05:33,149 --> 00:05:37,132
are too specific to the training data and will

91
00:05:37,133 --> 00:05:39,420
be irrelevant to new data.

92
00:05:40,190 --> 00:05:44,502
As a simple analogy, let's say we just bought

93
00:05:44,503 --> 00:05:50,500
a few grocery items in the local supermarket, like

94
00:05:50,501 --> 00:05:56,052
20 items during our vacation in a different country.

95
00:05:56,053 --> 00:05:59,786
The overall price of that basket

96
00:05:59,787 --> 00:06:03,194
was relatively cheaper than our expectation.

97
00:06:03,195 --> 00:06:06,472
It surprised us in a positive way.

98
00:06:06,473 --> 00:06:09,528
And when going back home, we told our friends

99
00:06:09,529 --> 00:06:15,910
that the supermarkets in that country are unbelievable cheaper.

100
00:06:16,570 --> 00:06:20,470
What do you think, is it a reasonable conclusion?

101
00:06:21,850 --> 00:06:23,836
Well, not really.

102
00:06:23,837 --> 00:06:30,096
We just overgeneralized a pattern about something from a

103
00:06:30,097 --> 00:06:34,416
relatively very small amount of samples for that small

104
00:06:34,417 --> 00:06:38,230
amount of items we bought in the supermarket.

105
00:06:38,231 --> 00:06:39,376
It is making sense.

106
00:06:39,377 --> 00:06:42,930
It is nicely fitting with the conclusion.

107
00:06:42,931 --> 00:06:47,124
Maybe we are right and maybe we are completely wrong.

108
00:06:47,125 --> 00:06:50,052
It makes sense to check a much larger amount

109
00:06:50,053 --> 00:06:54,270
of items in the supermarket before making a conclusion.

110
00:06:54,271 --> 00:06:56,488
Now when the same thing happens in

111
00:06:56,489 --> 00:06:59,694
machine learning, it is called overfitting.

112
00:06:59,695 --> 00:07:01,624
Overfitting is a very common

113
00:07:01,625 --> 00:07:03,758
situation when training models.

114
00:07:03,759 --> 00:07:07,362
It means that the trained model we created

115
00:07:07,363 --> 00:07:11,388
performs very well on the training data, but

116
00:07:11,389 --> 00:07:14,924
it does not generalized well to new data.

117
00:07:14,925 --> 00:07:18,860
The model is not performing well on new data.

118
00:07:20,190 --> 00:07:23,238
Looking on the same graph when drawing

119
00:07:23,239 --> 00:07:25,856
a line that is perfectly connecting the

120
00:07:25,857 --> 00:07:29,862
points is an example of overfitting.

121
00:07:29,863 --> 00:07:32,020
For those points in our training

122
00:07:32,021 --> 00:07:33,680
dataset it is perfect.

123
00:07:34,210 --> 00:07:36,996
But when we use it for new data

124
00:07:36,997 --> 00:07:40,160
points, this model will not perform so well.

125
00:07:41,090 --> 00:07:43,816
But why when using the training data

126
00:07:43,817 --> 00:07:47,380
set we encounter such an overfitting situation?

127
00:07:48,550 --> 00:07:52,094
Well, there are few common reasons.

128
00:07:52,095 --> 00:07:54,536
The training dataset is a

129
00:07:54,537 --> 00:07:58,018
sample of much larger distribution.

130
00:07:58,019 --> 00:08:01,932
If we take 100 items in a supermarket to be

131
00:08:01,933 --> 00:08:05,308
compared on a price level with the same items in

132
00:08:05,309 --> 00:08:09,494
a different supermarket, it is a small sample.

133
00:08:09,495 --> 00:08:12,208
It's not all the items available in the

134
00:08:12,209 --> 00:08:15,798
supermarket, which can be millions of items.

135
00:08:15,799 --> 00:08:19,824
Maybe if we will compare 1000 items or

136
00:08:19,825 --> 00:08:24,138
10,000 items, we will get much better distribution

137
00:08:24,139 --> 00:08:27,310
of data about the items in the supermarket.

138
00:08:28,450 --> 00:08:31,028
The same challenge is when training a

139
00:08:31,029 --> 00:08:33,207
model using the training dataset.

140
00:08:33,208 --> 00:08:35,342
The training dataset is a sample.

141
00:08:35,343 --> 00:08:37,501
It is a group of examples.

142
00:08:37,502 --> 00:08:42,381
It should be a large sample size that will resemble

143
00:08:42,382 --> 00:08:46,460
as good as possible the true distribution of the data.

144
00:08:46,461 --> 00:08:48,828
This is the key issue to remember.

145
00:08:48,829 --> 00:08:52,530
The training data should represent the distribution

146
00:08:52,531 --> 00:08:55,468
of the data as much as possible,

147
00:08:55,469 --> 00:08:59,580
otherwise it will just overfit the training data.

148
00:09:00,110 --> 00:09:04,064
The next reason it can be also too complex model.

149
00:09:04,065 --> 00:09:09,120
The objective of a model is to fit the data well, but

150
00:09:09,121 --> 00:09:12,244
at the same time fit the data as simple as possible.

151
00:09:12,245 --> 00:09:14,106
It is a careful balance.

152
00:09:14,107 --> 00:09:17,140
If the model is too complex while trying

153
00:09:17,141 --> 00:09:21,236
to fit perfectly the training data, then we

154
00:09:21,237 --> 00:09:25,336
increase the risk of overfitting anyway.

155
00:09:25,337 --> 00:09:27,496
Even if we took a very large data

156
00:09:27,497 --> 00:09:31,368
set, how can we discover such problems?

157
00:09:31,369 --> 00:09:34,712
How can we trust that the model will

158
00:09:34,713 --> 00:09:37,500
also make a good job on new data?

159
00:09:37,501 --> 00:09:40,348
Maybe it is overfitting the training dataset

160
00:09:40,349 --> 00:09:44,010
and it's not generalized well to new data.

161
00:09:44,011 --> 00:09:47,132
The answer is that we will need to test the

162
00:09:47,133 --> 00:09:51,632
model performance on a separate dataset to check and

163
00:09:51,633 --> 00:09:56,128
validate that our model is working well on new data.

164
00:09:56,129 --> 00:09:59,248
It is called the test dataset.

165
00:09:59,249 --> 00:10:01,088
The concept is quite simple.

166
00:10:01,089 --> 00:10:04,628
We will have a group of examples to train a

167
00:10:04,629 --> 00:10:07,956
model and another group of example to test the model.

168
00:10:07,957 --> 00:10:11,572
We'll talk about it later in this training. As

169
00:10:11,573 --> 00:10:13,940
a quick summary of the key things to remember.

170
00:10:14,470 --> 00:10:16,990
Training a model is not an easy task.

171
00:10:16,991 --> 00:10:20,056
It's actually the core job of a data

172
00:10:20,057 --> 00:10:24,970
scientist when building a machine learning project.

173
00:10:24,971 --> 00:10:26,972
The challenge is to make

174
00:10:26,973 --> 00:10:29,202
the trained model more generic.

175
00:10:29,203 --> 00:10:31,724
Making sure it is performing well on

176
00:10:31,725 --> 00:10:36,018
unseen data, making the model well generalized.

177
00:10:36,019 --> 00:10:37,346
There are two main challenges

178
00:10:37,347 --> 00:10:40,914
to overcome, underfitting and overfitting.

179
00:10:40,915 --> 00:10:44,080
Underfitting is when the training model is not working

180
00:10:44,081 --> 00:10:47,414
well on the training dataset and of course cannot

181
00:10:47,415 --> 00:10:51,430
generalize to new data. Overfitting which is more complex

182
00:10:51,431 --> 00:10:54,106
problem is when the training model we created

183
00:10:54,107 --> 00:10:57,492
perform very well on the training data, but

184
00:10:57,493 --> 00:11:00,276
it does not generalize well to new data.

185
00:11:00,277 --> 00:11:05,300
The model is not performing well on new data. Okay.

186
00:11:05,301 --> 00:11:07,348
Overall, this section was a high level

187
00:11:07,349 --> 00:11:11,330
introduction to the basic machine learning terminology.

188
00:11:11,331 --> 00:11:13,044
Please post a question if you

189
00:11:13,045 --> 00:11:14,580
would like to ask something.

190
00:11:14,581 --> 00:11:17,388
In the next section, we are going to talk

191
00:11:17,389 --> 00:11:21,240
about the main classification of machine learning system.

192
00:11:21,241 --> 00:11:23,827
[No audio]