1
00:00:00,000 --> 00:00:03,570
So going back to machine learning.

2
00:00:03,571 --> 00:00:07,204
Reinforcement learning is a method being used to let

3
00:00:07,205 --> 00:00:11,508
machines learn how to behave based on interaction with

4
00:00:11,509 --> 00:00:15,818
the environment while focusing on some end goal.

5
00:00:15,819 --> 00:00:17,812
We need to define this end

6
00:00:17,813 --> 00:00:20,852
goal like winning a chess game.

7
00:00:20,853 --> 00:00:22,788
But we don't need to tell the

8
00:00:22,789 --> 00:00:25,484
machines which actions to take.

9
00:00:25,485 --> 00:00:29,138
Those machine must discover which action

10
00:00:29,139 --> 00:00:31,890
will help to achieve the goal.

11
00:00:31,891 --> 00:00:33,932
They can select their actions from

12
00:00:33,933 --> 00:00:36,748
a space of possible options.

13
00:00:37,130 --> 00:00:42,192
Those algorithms are penalized when they

14
00:00:42,193 --> 00:00:45,174
make the wrong decisions, and rewarded

15
00:00:45,175 --> 00:00:47,710
when they make the right decisions.

16
00:00:47,711 --> 00:00:51,428
The visual way to describe a system that

17
00:00:51,429 --> 00:00:54,948
is using reinforcement learning is by using two

18
00:00:54,949 --> 00:00:59,492
building blocks, a learning agent which represent the

19
00:00:59,493 --> 00:01:02,762
machine, and the outside environment.

20
00:01:02,763 --> 00:01:07,576
This learning agent must be able to sense the state

21
00:01:07,577 --> 00:01:11,672
of the environment to some level and be able to

22
00:01:11,673 --> 00:01:15,986
make actions that can influence the state of the environment.

23
00:01:15,987 --> 00:01:20,572
To be clear, this agent is not necessarily a

24
00:01:20,573 --> 00:01:23,468
physically, fully functional robot or something like that.

25
00:01:23,469 --> 00:01:26,748
The agent can be some subcomponent in

26
00:01:26,749 --> 00:01:29,740
a larger system or some software model.

27
00:01:30,430 --> 00:01:35,664
As part of a sequence of interactions, the agent

28
00:01:35,665 --> 00:01:40,454
will decide which actions to perform on the environment.

29
00:01:40,455 --> 00:01:45,124
Those actions will of course change the state of

30
00:01:45,125 --> 00:01:49,876
the environment and then the new state will be

31
00:01:49,877 --> 00:01:55,752
translated to some numerical reward value that will be

32
00:01:55,753 --> 00:01:59,032
used as a feedback signal to the agent.

33
00:01:59,033 --> 00:02:03,560
The idea is that this reward signal is

34
00:02:03,561 --> 00:02:07,868
helping the agent to navigate and understand which

35
00:02:07,869 --> 00:02:10,738
actions will help to achieve the goal,

36
00:02:10,739 --> 00:02:14,252
this is like a feedback loop, helping the agent to

37
00:02:14,253 --> 00:02:18,080
learn from its own experience and then select the next

38
00:02:18,081 --> 00:02:21,740
best strategy to get the most reward over time.

39
00:02:23,150 --> 00:02:26,510
Using again the example of a chess game.

40
00:02:26,511 --> 00:02:30,596
The chess playing agent will play such game

41
00:02:30,597 --> 00:02:35,034
board and perform ongoing moves while making decisions.

42
00:02:35,035 --> 00:02:37,898
Those moves are the actions performed

43
00:02:37,899 --> 00:02:40,042
by the agent on the environment.

44
00:02:40,043 --> 00:02:43,140
The environment in our example is the game board.

45
00:02:43,141 --> 00:02:45,656
The goal of the game is winning, so the

46
00:02:45,657 --> 00:02:49,300
agent will be rewarded for winning a game.

47
00:02:49,990 --> 00:02:53,096
Let's use a new diagram and some

48
00:02:53,097 --> 00:02:55,800
simple math to describe this situation.

49
00:02:56,490 --> 00:02:59,212
The process of learning here is based

50
00:02:59,213 --> 00:03:02,674
on multiple steps on a time dimension.

51
00:03:02,675 --> 00:03:12,198
Time is represented by the small letter t, like t0, t1, t2, t3, etc.

52
00:03:12,199 --> 00:03:18,256
At each time step t, the agent, first of all

53
00:03:18,257 --> 00:03:23,092
receives and analyzes the state of the environment represented by

54
00:03:23,093 --> 00:03:28,756
the capital letter S in a particular time t and

55
00:03:28,757 --> 00:03:33,192
using this information, combined with knowledge gained so far, it

56
00:03:33,193 --> 00:03:39,822
will select some action that represented by the capital letter

57
00:03:39,823 --> 00:03:43,775
A with specific, again the same time t.

58
00:03:43,776 --> 00:03:48,460
Now, one step later, like t plus 1,

59
00:03:48,461 --> 00:03:54,156
as part of the consequence of its action, the agent

60
00:03:54,157 --> 00:04:00,624
receive or calculate a numerical reward signal that is called

61
00:04:00,625 --> 00:04:05,360
R capital R at t plus 1, and it will

62
00:04:05,361 --> 00:04:10,272
actually find itself in a new state St plus 1.

63
00:04:10,273 --> 00:04:15,144
So we have like groups S0, A0, R1.

64
00:04:15,145 --> 00:04:20,026
S1, A1, R2, and etc.

65
00:04:20,027 --> 00:04:23,080
This is the all idea of

66
00:04:23,081 --> 00:04:27,490
sequence of state, action, and reward.

67
00:04:28,070 --> 00:04:32,142
Assuming the agent just started to interact

68
00:04:32,143 --> 00:04:35,582
with the environment, how the agent decide

69
00:04:35,583 --> 00:04:37,560
which action will be next?

70
00:04:38,490 --> 00:04:40,882
Well, it is similar to the concept

71
00:04:40,883 --> 00:04:44,866
of learning something by trial and error.

72
00:04:44,867 --> 00:04:49,248
You can't teach a child how to ride a bike by

73
00:04:49,249 --> 00:04:53,542
explaining to him or her the rules of riding a bike.

74
00:04:53,543 --> 00:04:56,080
It will learn by trying many

75
00:04:56,081 --> 00:04:59,710
times and learning from each experience.

76
00:04:59,711 --> 00:05:04,324
Reinforcement learning is building a prediction model

77
00:05:04,325 --> 00:05:07,828
by gaining feedback from random trial and

78
00:05:07,829 --> 00:05:12,196
error, and leveraging the cumulative insight that

79
00:05:12,197 --> 00:05:15,350
it was collected from previous interactions.

80
00:05:15,351 --> 00:05:19,096
In our chess game example, the agent will start

81
00:05:19,097 --> 00:05:23,432
to play without knowing anything, exploring the space of

82
00:05:23,433 --> 00:05:28,284
options and then take actions. During the first game,

83
00:05:28,285 --> 00:05:32,812
it will be a very bad player and as a result

84
00:05:32,813 --> 00:05:38,226
it will get very strong negative feedback while losing games.

85
00:05:38,227 --> 00:05:41,840
Now the algorithm running inside the agent is

86
00:05:41,841 --> 00:05:45,696
trying to maximize reward meaning winning the game.

87
00:05:45,697 --> 00:05:50,768
So it will try different strategy as part of

88
00:05:50,769 --> 00:05:55,738
the trial and errors method for making better decision.

89
00:05:55,739 --> 00:05:59,802
Some of those actions will lead eventually

90
00:05:59,803 --> 00:06:03,272
to better result and the agent will

91
00:06:03,273 --> 00:06:06,340
learn from that cumulative experience.

92
00:06:08,070 --> 00:06:10,968
So this is the concept of reinforcement learning.

93
00:06:10,969 --> 00:06:14,760
Reinforcement learning is used in applications that the

94
00:06:14,761 --> 00:06:18,028
machine must make a sequence of decisions and

95
00:06:18,029 --> 00:06:21,570
those decisions are coming with positive or negative

96
00:06:21,571 --> 00:06:24,562
consequences that is collected as a feedback.

97
00:06:24,563 --> 00:06:27,196
The feedback going back to the agent is

98
00:06:27,197 --> 00:06:30,348
used to learn from the experience and basically

99
00:06:30,349 --> 00:06:33,810
get better and better in each iteration.

100
00:06:33,811 --> 00:06:37,388
Like playing a chess game thousands of times.

101
00:06:37,389 --> 00:06:41,284
Sometimes you win, sometimes you lose, but every

102
00:06:41,285 --> 00:06:45,716
game you learn something and get better.

103
00:06:45,717 --> 00:06:50,932
The cumulative knowledge on how to achieve a specific

104
00:06:50,933 --> 00:06:56,721
goal is reinforced again and again by experience.

105
00:06:57,170 --> 00:07:01,614
Now we know why it is called reinforcement learning.

106
00:07:01,615 --> 00:07:04,298
[No audio]