1
00:00:00,000 --> 00:00:03,882
The last learning type we will encounter

2
00:00:03,883 --> 00:00:07,572
in machine learning is called reinforcement learning

3
00:00:07,573 --> 00:00:11,722
and it is completely different approach compared

4
00:00:11,723 --> 00:00:15,892
to supervised or unsupervised learning.

5
00:00:15,893 --> 00:00:19,316
In reinforcement learning we are not using a group of

6
00:00:19,317 --> 00:00:25,012
labeled or unlabeled examples as input to train a model.

7
00:00:25,013 --> 00:00:29,026
I guess it may sound a little bit strange

8
00:00:29,027 --> 00:00:32,040
but don't worry it will be clear in few minutes.

9
00:00:32,650 --> 00:00:36,450
This method is used as a framework

10
00:00:36,451 --> 00:00:40,774
for decision-making task based on goals.

11
00:00:40,775 --> 00:00:43,622
It can be used to perform a complex

12
00:00:43,623 --> 00:00:48,570
objective while performing multiple sequence of actions.

13
00:00:49,550 --> 00:00:53,124
For example, it is widely used in building

14
00:00:53,125 --> 00:00:55,972
AI system for playing all kind of computer

15
00:00:55,973 --> 00:01:00,778
games while trying to achieve superhuman performance.

16
00:01:00,779 --> 00:01:04,760
It is used for teaching robots to perform

17
00:01:04,761 --> 00:01:09,064
tasks in dynamic environment or building real time

18
00:01:09,065 --> 00:01:12,120
recommendation system for website and much more.

19
00:01:12,121 --> 00:01:15,308
It is not as popular as supervised and

20
00:01:15,309 --> 00:01:19,932
unsupervised learning but it is getting momentum while

21
00:01:19,933 --> 00:01:23,858
ML practitioners are trying different approach to handle

22
00:01:23,859 --> 00:01:26,410
all kind of complex tasks.

23
00:01:26,411 --> 00:01:29,904
Let's take the example of playing a chess game.

24
00:01:29,905 --> 00:01:33,952
The objective is to win the game by deciding how

25
00:01:33,953 --> 00:01:38,224
to play multiple turns that are correlated to each other.

26
00:01:38,225 --> 00:01:41,738
Every move we would like to play has thousands

27
00:01:41,739 --> 00:01:46,948
of future options to consider while also anticipating the

28
00:01:46,949 --> 00:01:49,316
other player what he is going to do.

29
00:01:49,317 --> 00:01:52,470
And every move the two players are making

30
00:01:52,471 --> 00:01:55,304
is changing the ongoing state of the game.

31
00:01:55,305 --> 00:01:58,232
They influence the environment which is the

32
00:01:58,233 --> 00:02:01,646
game board by taking sequence of actions.

33
00:02:01,647 --> 00:02:04,728
It is a dynamic environment. To

34
00:02:04,729 --> 00:02:07,607
make it even more complicated,

35
00:02:07,690 --> 00:02:12,978
sometimes the result of actions are delayed.

36
00:02:12,979 --> 00:02:16,450
We decided to play a game in some strategy

37
00:02:16,451 --> 00:02:19,664
and only later during the game we will know

38
00:02:19,665 --> 00:02:22,704
if we have done a good decision or not.

39
00:02:22,705 --> 00:02:26,192
The feedback is delayed and it is

40
00:02:26,193 --> 00:02:30,612
sometimes difficult to understand which actions lead

41
00:02:30,613 --> 00:02:33,710
to which outcome over multiple steps.

42
00:02:34,370 --> 00:02:38,852
This type of task that involves some level of

43
00:02:38,853 --> 00:02:43,358
bi-directional interaction between a machine and the environment

44
00:02:43,359 --> 00:02:46,648
is not easily fitting into what we talked so

45
00:02:46,649 --> 00:02:50,376
far under supervised or unsupervised learning.

46
00:02:50,377 --> 00:02:55,842
We can't use here techniques like classification, clustering,

47
00:02:55,843 --> 00:02:59,052
or making prediction based on historical data.

48
00:02:59,053 --> 00:03:02,716
The way to handle such kind of task is

49
00:03:02,717 --> 00:03:07,628
by using the concept of reinforcement learning, and the

50
00:03:07,629 --> 00:03:11,984
best place to find an example of a system

51
00:03:11,985 --> 00:03:16,144
that can interact with the environment is to check

52
00:03:16,145 --> 00:03:21,204
what mother nature developed over billions of years.

53
00:03:21,205 --> 00:03:25,732
The concept of reinforcement learning is very similar to

54
00:03:25,733 --> 00:03:31,220
the way humans and other animals are learning and

55
00:03:31,221 --> 00:03:34,264
some of the algorithms being used in reinforcement learning

56
00:03:34,265 --> 00:03:39,220
were inspired by biological learning system.

57
00:03:39,750 --> 00:03:42,872
Each one of us can be described as

58
00:03:42,873 --> 00:03:47,852
a sophisticated biological machine that interacts with the

59
00:03:47,853 --> 00:03:51,554
physical environment in an endless feedback loop.

60
00:03:51,555 --> 00:03:54,610
Almost every action we are performing

61
00:03:54,611 --> 00:03:57,554
will have some kind of feedback.

62
00:03:57,555 --> 00:04:00,112
We try things and get feedback and

63
00:04:00,113 --> 00:04:02,060
based on the feedback we are learning.

64
00:04:02,750 --> 00:04:05,100
Let me give you a very simple example.

65
00:04:05,630 --> 00:04:08,886
If I will try to pick up 20 kilogram

66
00:04:08,887 --> 00:04:12,788
weight in the gym for a freestyle training,

67
00:04:12,789 --> 00:04:16,387
then the immediate feedback will be that it's too heavy for

68
00:04:16,388 --> 00:04:20,516
me to exercise. So I can decide to try much less

69
00:04:20,517 --> 00:04:24,286
weight and drop it, for example to 10 kilogram.

70
00:04:24,287 --> 00:04:26,216
And maybe the feedback will be

71
00:04:26,217 --> 00:04:28,280
that it's too light for me.

72
00:04:28,281 --> 00:04:30,622
Again, based on the feedback, I can raise

73
00:04:30,623 --> 00:04:35,016
it to 15 kilogram and continue to perform

74
00:04:35,017 --> 00:04:39,212
those adjustment until getting the best weight that

75
00:04:39,213 --> 00:04:42,258
is perfect for my training goals.

76
00:04:42,259 --> 00:04:45,100
How did I know which one is the best?

77
00:04:45,101 --> 00:04:46,810
Well, I didn't.

78
00:04:46,811 --> 00:04:50,960
I tried a few options and learned from

79
00:04:50,961 --> 00:04:54,448
the experience based on the feedback I got

80
00:04:54,449 --> 00:04:56,540
while trying to pick up some options.

81
00:04:57,070 --> 00:05:00,996
Now, many things we learned during our

82
00:05:00,997 --> 00:05:04,490
life are based on such a continuous

83
00:05:04,491 --> 00:05:08,052
feedback loop, based on actual experience.

84
00:05:08,053 --> 00:05:11,204
Think about how you learned to drive a car.

85
00:05:11,205 --> 00:05:13,492
We can't learn how to drive

86
00:05:13,493 --> 00:05:16,270
only by reading a user guide.

87
00:05:16,271 --> 00:05:18,808
There are, of course basic rules we need to

88
00:05:18,809 --> 00:05:21,640
learn and follow while driving on the road,

89
00:05:21,641 --> 00:05:24,520
but the actual part of operating a real

90
00:05:24,521 --> 00:05:28,892
car and handling a variety of road situation

91
00:05:28,893 --> 00:05:31,930
is something that we must learn from experience.

92
00:05:31,931 --> 00:05:34,908
Like the interaction with the car as a

93
00:05:34,909 --> 00:05:37,698
machine that we need to operate, the interaction

94
00:05:37,699 --> 00:05:42,730
with the road conditions, interaction with other drivers.

95
00:05:42,731 --> 00:05:48,044
Learning from interaction is a fundamental idea

96
00:05:48,045 --> 00:05:50,868
in our daily life and this is

97
00:05:50,869 --> 00:05:54,708
the great analogy for reinforcement learning.

98
00:05:54,709 --> 00:05:57,563
Learning from interaction.