1 00:00:00,000 --> 00:00:03,882 The last learning type we will encounter 2 00:00:03,883 --> 00:00:07,572 in machine learning is called reinforcement learning 3 00:00:07,573 --> 00:00:11,722 and it is completely different approach compared 4 00:00:11,723 --> 00:00:15,892 to supervised or unsupervised learning. 5 00:00:15,893 --> 00:00:19,316 In reinforcement learning we are not using a group of 6 00:00:19,317 --> 00:00:25,012 labeled or unlabeled examples as input to train a model. 7 00:00:25,013 --> 00:00:29,026 I guess it may sound a little bit strange 8 00:00:29,027 --> 00:00:32,040 but don't worry it will be clear in few minutes. 9 00:00:32,650 --> 00:00:36,450 This method is used as a framework 10 00:00:36,451 --> 00:00:40,774 for decision-making task based on goals. 11 00:00:40,775 --> 00:00:43,622 It can be used to perform a complex 12 00:00:43,623 --> 00:00:48,570 objective while performing multiple sequence of actions. 13 00:00:49,550 --> 00:00:53,124 For example, it is widely used in building 14 00:00:53,125 --> 00:00:55,972 AI system for playing all kind of computer 15 00:00:55,973 --> 00:01:00,778 games while trying to achieve superhuman performance. 16 00:01:00,779 --> 00:01:04,760 It is used for teaching robots to perform 17 00:01:04,761 --> 00:01:09,064 tasks in dynamic environment or building real time 18 00:01:09,065 --> 00:01:12,120 recommendation system for website and much more. 19 00:01:12,121 --> 00:01:15,308 It is not as popular as supervised and 20 00:01:15,309 --> 00:01:19,932 unsupervised learning but it is getting momentum while 21 00:01:19,933 --> 00:01:23,858 ML practitioners are trying different approach to handle 22 00:01:23,859 --> 00:01:26,410 all kind of complex tasks. 23 00:01:26,411 --> 00:01:29,904 Let's take the example of playing a chess game. 24 00:01:29,905 --> 00:01:33,952 The objective is to win the game by deciding how 25 00:01:33,953 --> 00:01:38,224 to play multiple turns that are correlated to each other. 26 00:01:38,225 --> 00:01:41,738 Every move we would like to play has thousands 27 00:01:41,739 --> 00:01:46,948 of future options to consider while also anticipating the 28 00:01:46,949 --> 00:01:49,316 other player what he is going to do. 29 00:01:49,317 --> 00:01:52,470 And every move the two players are making 30 00:01:52,471 --> 00:01:55,304 is changing the ongoing state of the game. 31 00:01:55,305 --> 00:01:58,232 They influence the environment which is the 32 00:01:58,233 --> 00:02:01,646 game board by taking sequence of actions. 33 00:02:01,647 --> 00:02:04,728 It is a dynamic environment. To 34 00:02:04,729 --> 00:02:07,607 make it even more complicated, 35 00:02:07,690 --> 00:02:12,978 sometimes the result of actions are delayed. 36 00:02:12,979 --> 00:02:16,450 We decided to play a game in some strategy 37 00:02:16,451 --> 00:02:19,664 and only later during the game we will know 38 00:02:19,665 --> 00:02:22,704 if we have done a good decision or not. 39 00:02:22,705 --> 00:02:26,192 The feedback is delayed and it is 40 00:02:26,193 --> 00:02:30,612 sometimes difficult to understand which actions lead 41 00:02:30,613 --> 00:02:33,710 to which outcome over multiple steps. 42 00:02:34,370 --> 00:02:38,852 This type of task that involves some level of 43 00:02:38,853 --> 00:02:43,358 bi-directional interaction between a machine and the environment 44 00:02:43,359 --> 00:02:46,648 is not easily fitting into what we talked so 45 00:02:46,649 --> 00:02:50,376 far under supervised or unsupervised learning. 46 00:02:50,377 --> 00:02:55,842 We can't use here techniques like classification, clustering, 47 00:02:55,843 --> 00:02:59,052 or making prediction based on historical data. 48 00:02:59,053 --> 00:03:02,716 The way to handle such kind of task is 49 00:03:02,717 --> 00:03:07,628 by using the concept of reinforcement learning, and the 50 00:03:07,629 --> 00:03:11,984 best place to find an example of a system 51 00:03:11,985 --> 00:03:16,144 that can interact with the environment is to check 52 00:03:16,145 --> 00:03:21,204 what mother nature developed over billions of years. 53 00:03:21,205 --> 00:03:25,732 The concept of reinforcement learning is very similar to 54 00:03:25,733 --> 00:03:31,220 the way humans and other animals are learning and 55 00:03:31,221 --> 00:03:34,264 some of the algorithms being used in reinforcement learning 56 00:03:34,265 --> 00:03:39,220 were inspired by biological learning system. 57 00:03:39,750 --> 00:03:42,872 Each one of us can be described as 58 00:03:42,873 --> 00:03:47,852 a sophisticated biological machine that interacts with the 59 00:03:47,853 --> 00:03:51,554 physical environment in an endless feedback loop. 60 00:03:51,555 --> 00:03:54,610 Almost every action we are performing 61 00:03:54,611 --> 00:03:57,554 will have some kind of feedback. 62 00:03:57,555 --> 00:04:00,112 We try things and get feedback and 63 00:04:00,113 --> 00:04:02,060 based on the feedback we are learning. 64 00:04:02,750 --> 00:04:05,100 Let me give you a very simple example. 65 00:04:05,630 --> 00:04:08,886 If I will try to pick up 20 kilogram 66 00:04:08,887 --> 00:04:12,788 weight in the gym for a freestyle training, 67 00:04:12,789 --> 00:04:16,387 then the immediate feedback will be that it's too heavy for 68 00:04:16,388 --> 00:04:20,516 me to exercise. So I can decide to try much less 69 00:04:20,517 --> 00:04:24,286 weight and drop it, for example to 10 kilogram. 70 00:04:24,287 --> 00:04:26,216 And maybe the feedback will be 71 00:04:26,217 --> 00:04:28,280 that it's too light for me. 72 00:04:28,281 --> 00:04:30,622 Again, based on the feedback, I can raise 73 00:04:30,623 --> 00:04:35,016 it to 15 kilogram and continue to perform 74 00:04:35,017 --> 00:04:39,212 those adjustment until getting the best weight that 75 00:04:39,213 --> 00:04:42,258 is perfect for my training goals. 76 00:04:42,259 --> 00:04:45,100 How did I know which one is the best? 77 00:04:45,101 --> 00:04:46,810 Well, I didn't. 78 00:04:46,811 --> 00:04:50,960 I tried a few options and learned from 79 00:04:50,961 --> 00:04:54,448 the experience based on the feedback I got 80 00:04:54,449 --> 00:04:56,540 while trying to pick up some options. 81 00:04:57,070 --> 00:05:00,996 Now, many things we learned during our 82 00:05:00,997 --> 00:05:04,490 life are based on such a continuous 83 00:05:04,491 --> 00:05:08,052 feedback loop, based on actual experience. 84 00:05:08,053 --> 00:05:11,204 Think about how you learned to drive a car. 85 00:05:11,205 --> 00:05:13,492 We can't learn how to drive 86 00:05:13,493 --> 00:05:16,270 only by reading a user guide. 87 00:05:16,271 --> 00:05:18,808 There are, of course basic rules we need to 88 00:05:18,809 --> 00:05:21,640 learn and follow while driving on the road, 89 00:05:21,641 --> 00:05:24,520 but the actual part of operating a real 90 00:05:24,521 --> 00:05:28,892 car and handling a variety of road situation 91 00:05:28,893 --> 00:05:31,930 is something that we must learn from experience. 92 00:05:31,931 --> 00:05:34,908 Like the interaction with the car as a 93 00:05:34,909 --> 00:05:37,698 machine that we need to operate, the interaction 94 00:05:37,699 --> 00:05:42,730 with the road conditions, interaction with other drivers. 95 00:05:42,731 --> 00:05:48,044 Learning from interaction is a fundamental idea 96 00:05:48,045 --> 00:05:50,868 in our daily life and this is 97 00:05:50,869 --> 00:05:54,708 the great analogy for reinforcement learning. 98 00:05:54,709 --> 00:05:57,563 Learning from interaction.