1 00:00:00,000 --> 00:00:03,570 So going back to machine learning. 2 00:00:03,571 --> 00:00:07,204 Reinforcement learning is a method being used to let 3 00:00:07,205 --> 00:00:11,508 machines learn how to behave based on interaction with 4 00:00:11,509 --> 00:00:15,818 the environment while focusing on some end goal. 5 00:00:15,819 --> 00:00:17,812 We need to define this end 6 00:00:17,813 --> 00:00:20,852 goal like winning a chess game. 7 00:00:20,853 --> 00:00:22,788 But we don't need to tell the 8 00:00:22,789 --> 00:00:25,484 machines which actions to take. 9 00:00:25,485 --> 00:00:29,138 Those machine must discover which action 10 00:00:29,139 --> 00:00:31,890 will help to achieve the goal. 11 00:00:31,891 --> 00:00:33,932 They can select their actions from 12 00:00:33,933 --> 00:00:36,748 a space of possible options. 13 00:00:37,130 --> 00:00:42,192 Those algorithms are penalized when they 14 00:00:42,193 --> 00:00:45,174 make the wrong decisions, and rewarded 15 00:00:45,175 --> 00:00:47,710 when they make the right decisions. 16 00:00:47,711 --> 00:00:51,428 The visual way to describe a system that 17 00:00:51,429 --> 00:00:54,948 is using reinforcement learning is by using two 18 00:00:54,949 --> 00:00:59,492 building blocks, a learning agent which represent the 19 00:00:59,493 --> 00:01:02,762 machine, and the outside environment. 20 00:01:02,763 --> 00:01:07,576 This learning agent must be able to sense the state 21 00:01:07,577 --> 00:01:11,672 of the environment to some level and be able to 22 00:01:11,673 --> 00:01:15,986 make actions that can influence the state of the environment. 23 00:01:15,987 --> 00:01:20,572 To be clear, this agent is not necessarily a 24 00:01:20,573 --> 00:01:23,468 physically, fully functional robot or something like that. 25 00:01:23,469 --> 00:01:26,748 The agent can be some subcomponent in 26 00:01:26,749 --> 00:01:29,740 a larger system or some software model. 27 00:01:30,430 --> 00:01:35,664 As part of a sequence of interactions, the agent 28 00:01:35,665 --> 00:01:40,454 will decide which actions to perform on the environment. 29 00:01:40,455 --> 00:01:45,124 Those actions will of course change the state of 30 00:01:45,125 --> 00:01:49,876 the environment and then the new state will be 31 00:01:49,877 --> 00:01:55,752 translated to some numerical reward value that will be 32 00:01:55,753 --> 00:01:59,032 used as a feedback signal to the agent. 33 00:01:59,033 --> 00:02:03,560 The idea is that this reward signal is 34 00:02:03,561 --> 00:02:07,868 helping the agent to navigate and understand which 35 00:02:07,869 --> 00:02:10,738 actions will help to achieve the goal, 36 00:02:10,739 --> 00:02:14,252 this is like a feedback loop, helping the agent to 37 00:02:14,253 --> 00:02:18,080 learn from its own experience and then select the next 38 00:02:18,081 --> 00:02:21,740 best strategy to get the most reward over time. 39 00:02:23,150 --> 00:02:26,510 Using again the example of a chess game. 40 00:02:26,511 --> 00:02:30,596 The chess playing agent will play such game 41 00:02:30,597 --> 00:02:35,034 board and perform ongoing moves while making decisions. 42 00:02:35,035 --> 00:02:37,898 Those moves are the actions performed 43 00:02:37,899 --> 00:02:40,042 by the agent on the environment. 44 00:02:40,043 --> 00:02:43,140 The environment in our example is the game board. 45 00:02:43,141 --> 00:02:45,656 The goal of the game is winning, so the 46 00:02:45,657 --> 00:02:49,300 agent will be rewarded for winning a game. 47 00:02:49,990 --> 00:02:53,096 Let's use a new diagram and some 48 00:02:53,097 --> 00:02:55,800 simple math to describe this situation. 49 00:02:56,490 --> 00:02:59,212 The process of learning here is based 50 00:02:59,213 --> 00:03:02,674 on multiple steps on a time dimension. 51 00:03:02,675 --> 00:03:12,198 Time is represented by the small letter t, like t0, t1, t2, t3, etc. 52 00:03:12,199 --> 00:03:18,256 At each time step t, the agent, first of all 53 00:03:18,257 --> 00:03:23,092 receives and analyzes the state of the environment represented by 54 00:03:23,093 --> 00:03:28,756 the capital letter S in a particular time t and 55 00:03:28,757 --> 00:03:33,192 using this information, combined with knowledge gained so far, it 56 00:03:33,193 --> 00:03:39,822 will select some action that represented by the capital letter 57 00:03:39,823 --> 00:03:43,775 A with specific, again the same time t. 58 00:03:43,776 --> 00:03:48,460 Now, one step later, like t plus 1, 59 00:03:48,461 --> 00:03:54,156 as part of the consequence of its action, the agent 60 00:03:54,157 --> 00:04:00,624 receive or calculate a numerical reward signal that is called 61 00:04:00,625 --> 00:04:05,360 R capital R at t plus 1, and it will 62 00:04:05,361 --> 00:04:10,272 actually find itself in a new state St plus 1. 63 00:04:10,273 --> 00:04:15,144 So we have like groups S0, A0, R1. 64 00:04:15,145 --> 00:04:20,026 S1, A1, R2, and etc. 65 00:04:20,027 --> 00:04:23,080 This is the all idea of 66 00:04:23,081 --> 00:04:27,490 sequence of state, action, and reward. 67 00:04:28,070 --> 00:04:32,142 Assuming the agent just started to interact 68 00:04:32,143 --> 00:04:35,582 with the environment, how the agent decide 69 00:04:35,583 --> 00:04:37,560 which action will be next? 70 00:04:38,490 --> 00:04:40,882 Well, it is similar to the concept 71 00:04:40,883 --> 00:04:44,866 of learning something by trial and error. 72 00:04:44,867 --> 00:04:49,248 You can't teach a child how to ride a bike by 73 00:04:49,249 --> 00:04:53,542 explaining to him or her the rules of riding a bike. 74 00:04:53,543 --> 00:04:56,080 It will learn by trying many 75 00:04:56,081 --> 00:04:59,710 times and learning from each experience. 76 00:04:59,711 --> 00:05:04,324 Reinforcement learning is building a prediction model 77 00:05:04,325 --> 00:05:07,828 by gaining feedback from random trial and 78 00:05:07,829 --> 00:05:12,196 error, and leveraging the cumulative insight that 79 00:05:12,197 --> 00:05:15,350 it was collected from previous interactions. 80 00:05:15,351 --> 00:05:19,096 In our chess game example, the agent will start 81 00:05:19,097 --> 00:05:23,432 to play without knowing anything, exploring the space of 82 00:05:23,433 --> 00:05:28,284 options and then take actions. During the first game, 83 00:05:28,285 --> 00:05:32,812 it will be a very bad player and as a result 84 00:05:32,813 --> 00:05:38,226 it will get very strong negative feedback while losing games. 85 00:05:38,227 --> 00:05:41,840 Now the algorithm running inside the agent is 86 00:05:41,841 --> 00:05:45,696 trying to maximize reward meaning winning the game. 87 00:05:45,697 --> 00:05:50,768 So it will try different strategy as part of 88 00:05:50,769 --> 00:05:55,738 the trial and errors method for making better decision. 89 00:05:55,739 --> 00:05:59,802 Some of those actions will lead eventually 90 00:05:59,803 --> 00:06:03,272 to better result and the agent will 91 00:06:03,273 --> 00:06:06,340 learn from that cumulative experience. 92 00:06:08,070 --> 00:06:10,968 So this is the concept of reinforcement learning. 93 00:06:10,969 --> 00:06:14,760 Reinforcement learning is used in applications that the 94 00:06:14,761 --> 00:06:18,028 machine must make a sequence of decisions and 95 00:06:18,029 --> 00:06:21,570 those decisions are coming with positive or negative 96 00:06:21,571 --> 00:06:24,562 consequences that is collected as a feedback. 97 00:06:24,563 --> 00:06:27,196 The feedback going back to the agent is 98 00:06:27,197 --> 00:06:30,348 used to learn from the experience and basically 99 00:06:30,349 --> 00:06:33,810 get better and better in each iteration. 100 00:06:33,811 --> 00:06:37,388 Like playing a chess game thousands of times. 101 00:06:37,389 --> 00:06:41,284 Sometimes you win, sometimes you lose, but every 102 00:06:41,285 --> 00:06:45,716 game you learn something and get better. 103 00:06:45,717 --> 00:06:50,932 The cumulative knowledge on how to achieve a specific 104 00:06:50,933 --> 00:06:56,721 goal is reinforced again and again by experience. 105 00:06:57,170 --> 00:07:01,614 Now we know why it is called reinforcement learning. 106 00:07:01,615 --> 00:07:04,298 [No audio]