1
00:00:00,000 --> 00:00:01,370
[No audio]

2
00:00:01,371 --> 00:00:04,596
Clustering is probably one of the most common use

3
00:00:04,597 --> 00:00:11,210
cases of unsupervised learning, is the task of identifying

4
00:00:11,211 --> 00:00:16,054
similar instances with shared attributes in in a dataset,

5
00:00:16,055 --> 00:00:18,682
and group them together into clusters.

6
00:00:18,683 --> 00:00:22,260
Grouping a set of objects in such a way

7
00:00:22,261 --> 00:00:26,044
that objects in the same group are more similar

8
00:00:26,045 --> 00:00:28,882
to each other than those in other groups.

9
00:00:28,883 --> 00:00:33,370
The output of the algorithm will be a set of labels,

10
00:00:33,371 --> 00:00:35,884
assigning each data point to

11
00:00:35,885 --> 00:00:38,150
one of the identified clusters.

12
00:00:39,710 --> 00:00:43,696
Take a look at this graph which represents many

13
00:00:43,697 --> 00:00:48,224
data points we have in an unlabeled dataset.

14
00:00:48,225 --> 00:00:52,676
We can easily see that some points are denser in

15
00:00:52,677 --> 00:00:57,840
specific areas, meaning they may have something in common.

16
00:00:58,450 --> 00:01:02,436
This is where clustering algorithms can do their job.

17
00:01:02,437 --> 00:01:05,752
They can use the input features of the dataset

18
00:01:05,753 --> 00:01:11,250
and automatically assign each data points to a cluster.

19
00:01:12,230 --> 00:01:14,456
Let's color the data points in

20
00:01:14,457 --> 00:01:18,578
three different colors red, green, blue.

21
00:01:18,579 --> 00:01:21,580
In that scenario, the clustering algorithm will find

22
00:01:21,581 --> 00:01:27,164
those three clusters and automatically label them with

23
00:01:27,165 --> 00:01:29,584
something that is called cluster ID,

24
00:01:29,585 --> 00:01:33,630
cluster number 1, cluster number 2, cluster number 3.

25
00:01:33,631 --> 00:01:37,878
So after clustering, each cluster is assigned a unique

26
00:01:37,879 --> 00:01:41,460
number called this cluster ID and each data point

27
00:01:41,461 --> 00:01:46,388
or instance will be assigned to one cluster ID.

28
00:01:46,389 --> 00:01:51,290
This kind of information that was identified automatically

29
00:01:51,291 --> 00:01:54,830
by the algorithm may be a useful insight

30
00:01:54,831 --> 00:01:57,380
about the dataset that we can use.

31
00:01:58,230 --> 00:02:01,166
Clustering is used in a wide variety

32
00:02:01,167 --> 00:02:03,510
of use cases in the industry.

33
00:02:03,511 --> 00:02:05,580
For example, it is used for

34
00:02:05,581 --> 00:02:08,788
market or also called customer segmentations.

35
00:02:09,850 --> 00:02:12,588
I mean, all business companies today would like

36
00:02:12,589 --> 00:02:16,492
to better know and understand their customers, who

37
00:02:16,493 --> 00:02:20,054
they are and what's driving their purchase decisions.

38
00:02:20,055 --> 00:02:23,296
This kind of segmentation can help

39
00:02:23,297 --> 00:02:26,800
to adapt products, services, and also

40
00:02:26,801 --> 00:02:31,146
market campaigns to each identified segment.

41
00:02:31,147 --> 00:02:34,404
For example, suppose a business has data

42
00:02:34,405 --> 00:02:38,612
about customers such as demographic information and

43
00:02:38,613 --> 00:02:41,594
also their historic purchasing behavior.

44
00:02:41,595 --> 00:02:47,176
A clustering algorithm can identify subsegments of the

45
00:02:47,177 --> 00:02:51,384
whole market where a particular type of product

46
00:02:51,385 --> 00:02:55,308
is very successful in helping to design a

47
00:02:55,309 --> 00:02:59,590
focused market a message to that specific segment.

48
00:03:00,810 --> 00:03:03,452
Another interesting use case is called

49
00:03:03,453 --> 00:03:06,970
anomaly detection or outlier detection.

50
00:03:06,971 --> 00:03:10,902
For example, the scenario that you need to detect

51
00:03:10,903 --> 00:03:15,744
defects in a manufacturing process of some product.

52
00:03:15,745 --> 00:03:18,928
So there will be all kind of sensors that

53
00:03:18,929 --> 00:03:23,892
are measuring different physical characteristics of the products.

54
00:03:23,893 --> 00:03:28,772
And then you can run such clustering algorithms to

55
00:03:28,773 --> 00:03:32,212
find data points that are too far from the

56
00:03:32,213 --> 00:03:36,136
center of a specific cluster, which makes them look

57
00:03:36,137 --> 00:03:39,990
like an anomaly, like the size of that product,

58
00:03:39,991 --> 00:03:42,660
the boundary of that product, and so on.

59
00:03:43,190 --> 00:03:47,058
Another example will be if I'm taking pictures

60
00:03:47,059 --> 00:03:51,052
of a product during the manufacturing process and

61
00:03:51,053 --> 00:03:55,596
then trying to identify products with defects using,

62
00:03:55,597 --> 00:03:58,310
again the method of clustering.

63
00:03:58,990 --> 00:04:00,448
The third one that I would like

64
00:04:00,449 --> 00:04:04,700
to mention is called semi-supervised learning.

65
00:04:05,470 --> 00:04:07,862
This is a method that is sitting

66
00:04:07,863 --> 00:04:12,690
between supervised learning and unsupervised learning.

67
00:04:12,691 --> 00:04:15,892
The idea here is that we can run

68
00:04:15,893 --> 00:04:21,156
a clustering algorithm on an unlabeled dataset

69
00:04:21,157 --> 00:04:25,912
that will create few clusters as labels, like

70
00:04:25,913 --> 00:04:28,610
cluster number one, number two, et cetera.

71
00:04:29,350 --> 00:04:32,798
Then I will get a very small amount of clusters

72
00:04:32,799 --> 00:04:38,220
that I can manually label, like, this cluster is red,

73
00:04:38,221 --> 00:04:41,708
this cluster is blue, this cluster is green, or whatever

74
00:04:41,709 --> 00:04:43,916
criteria that I would like to use.

75
00:04:43,917 --> 00:04:47,612
And then I can propagate those labels to

76
00:04:47,613 --> 00:04:50,338
all the instances in the same cluster.

77
00:04:50,339 --> 00:04:55,308
And now suddenly I have labeled dataset that can

78
00:04:55,309 --> 00:04:59,260
be used for training a model in supervised learning.

79
00:04:59,261 --> 00:05:01,098
Very interesting approach.

80
00:05:01,099 --> 00:05:03,642
Let's move to the next common task

81
00:05:03,643 --> 00:05:07,550
in unsupervised learning called dimension reduction.

82
00:05:07,551 --> 00:05:09,475
[No audio]