1 00:00:00,000 --> 00:00:01,370 [No audio] 2 00:00:01,371 --> 00:00:04,596 Clustering is probably one of the most common use 3 00:00:04,597 --> 00:00:11,210 cases of unsupervised learning, is the task of identifying 4 00:00:11,211 --> 00:00:16,054 similar instances with shared attributes in in a dataset, 5 00:00:16,055 --> 00:00:18,682 and group them together into clusters. 6 00:00:18,683 --> 00:00:22,260 Grouping a set of objects in such a way 7 00:00:22,261 --> 00:00:26,044 that objects in the same group are more similar 8 00:00:26,045 --> 00:00:28,882 to each other than those in other groups. 9 00:00:28,883 --> 00:00:33,370 The output of the algorithm will be a set of labels, 10 00:00:33,371 --> 00:00:35,884 assigning each data point to 11 00:00:35,885 --> 00:00:38,150 one of the identified clusters. 12 00:00:39,710 --> 00:00:43,696 Take a look at this graph which represents many 13 00:00:43,697 --> 00:00:48,224 data points we have in an unlabeled dataset. 14 00:00:48,225 --> 00:00:52,676 We can easily see that some points are denser in 15 00:00:52,677 --> 00:00:57,840 specific areas, meaning they may have something in common. 16 00:00:58,450 --> 00:01:02,436 This is where clustering algorithms can do their job. 17 00:01:02,437 --> 00:01:05,752 They can use the input features of the dataset 18 00:01:05,753 --> 00:01:11,250 and automatically assign each data points to a cluster. 19 00:01:12,230 --> 00:01:14,456 Let's color the data points in 20 00:01:14,457 --> 00:01:18,578 three different colors red, green, blue. 21 00:01:18,579 --> 00:01:21,580 In that scenario, the clustering algorithm will find 22 00:01:21,581 --> 00:01:27,164 those three clusters and automatically label them with 23 00:01:27,165 --> 00:01:29,584 something that is called cluster ID, 24 00:01:29,585 --> 00:01:33,630 cluster number 1, cluster number 2, cluster number 3. 25 00:01:33,631 --> 00:01:37,878 So after clustering, each cluster is assigned a unique 26 00:01:37,879 --> 00:01:41,460 number called this cluster ID and each data point 27 00:01:41,461 --> 00:01:46,388 or instance will be assigned to one cluster ID. 28 00:01:46,389 --> 00:01:51,290 This kind of information that was identified automatically 29 00:01:51,291 --> 00:01:54,830 by the algorithm may be a useful insight 30 00:01:54,831 --> 00:01:57,380 about the dataset that we can use. 31 00:01:58,230 --> 00:02:01,166 Clustering is used in a wide variety 32 00:02:01,167 --> 00:02:03,510 of use cases in the industry. 33 00:02:03,511 --> 00:02:05,580 For example, it is used for 34 00:02:05,581 --> 00:02:08,788 market or also called customer segmentations. 35 00:02:09,850 --> 00:02:12,588 I mean, all business companies today would like 36 00:02:12,589 --> 00:02:16,492 to better know and understand their customers, who 37 00:02:16,493 --> 00:02:20,054 they are and what's driving their purchase decisions. 38 00:02:20,055 --> 00:02:23,296 This kind of segmentation can help 39 00:02:23,297 --> 00:02:26,800 to adapt products, services, and also 40 00:02:26,801 --> 00:02:31,146 market campaigns to each identified segment. 41 00:02:31,147 --> 00:02:34,404 For example, suppose a business has data 42 00:02:34,405 --> 00:02:38,612 about customers such as demographic information and 43 00:02:38,613 --> 00:02:41,594 also their historic purchasing behavior. 44 00:02:41,595 --> 00:02:47,176 A clustering algorithm can identify subsegments of the 45 00:02:47,177 --> 00:02:51,384 whole market where a particular type of product 46 00:02:51,385 --> 00:02:55,308 is very successful in helping to design a 47 00:02:55,309 --> 00:02:59,590 focused market a message to that specific segment. 48 00:03:00,810 --> 00:03:03,452 Another interesting use case is called 49 00:03:03,453 --> 00:03:06,970 anomaly detection or outlier detection. 50 00:03:06,971 --> 00:03:10,902 For example, the scenario that you need to detect 51 00:03:10,903 --> 00:03:15,744 defects in a manufacturing process of some product. 52 00:03:15,745 --> 00:03:18,928 So there will be all kind of sensors that 53 00:03:18,929 --> 00:03:23,892 are measuring different physical characteristics of the products. 54 00:03:23,893 --> 00:03:28,772 And then you can run such clustering algorithms to 55 00:03:28,773 --> 00:03:32,212 find data points that are too far from the 56 00:03:32,213 --> 00:03:36,136 center of a specific cluster, which makes them look 57 00:03:36,137 --> 00:03:39,990 like an anomaly, like the size of that product, 58 00:03:39,991 --> 00:03:42,660 the boundary of that product, and so on. 59 00:03:43,190 --> 00:03:47,058 Another example will be if I'm taking pictures 60 00:03:47,059 --> 00:03:51,052 of a product during the manufacturing process and 61 00:03:51,053 --> 00:03:55,596 then trying to identify products with defects using, 62 00:03:55,597 --> 00:03:58,310 again the method of clustering. 63 00:03:58,990 --> 00:04:00,448 The third one that I would like 64 00:04:00,449 --> 00:04:04,700 to mention is called semi-supervised learning. 65 00:04:05,470 --> 00:04:07,862 This is a method that is sitting 66 00:04:07,863 --> 00:04:12,690 between supervised learning and unsupervised learning. 67 00:04:12,691 --> 00:04:15,892 The idea here is that we can run 68 00:04:15,893 --> 00:04:21,156 a clustering algorithm on an unlabeled dataset 69 00:04:21,157 --> 00:04:25,912 that will create few clusters as labels, like 70 00:04:25,913 --> 00:04:28,610 cluster number one, number two, et cetera. 71 00:04:29,350 --> 00:04:32,798 Then I will get a very small amount of clusters 72 00:04:32,799 --> 00:04:38,220 that I can manually label, like, this cluster is red, 73 00:04:38,221 --> 00:04:41,708 this cluster is blue, this cluster is green, or whatever 74 00:04:41,709 --> 00:04:43,916 criteria that I would like to use. 75 00:04:43,917 --> 00:04:47,612 And then I can propagate those labels to 76 00:04:47,613 --> 00:04:50,338 all the instances in the same cluster. 77 00:04:50,339 --> 00:04:55,308 And now suddenly I have labeled dataset that can 78 00:04:55,309 --> 00:04:59,260 be used for training a model in supervised learning. 79 00:04:59,261 --> 00:05:01,098 Very interesting approach. 80 00:05:01,099 --> 00:05:03,642 Let's move to the next common task 81 00:05:03,643 --> 00:05:07,550 in unsupervised learning called dimension reduction. 82 00:05:07,551 --> 00:05:09,475 [No audio]