K-means clustering
Start course
Difficulty
Beginner
Duration
55m
Students
613
Ratings
4.1/5
Description

This course covers the concept of unsupervised learning within the context of machine learning. You'll learn the fundamentals of unsupervised learning and how it differs from supervised learning. We'll cover the topics of clustering, k-means clustering (and its limitations), and dimensionality reduction.

Transcript

- Now that you've seen the idea behind clustering, perhaps you're wondering, how does machine produce the clusters that we've been looking at? How does it give us those purple circles, which are specifically known as distributions? How does it give us those distributions that describe underlying groupings observations? What is the algorithm? What is the approach? Well, actually there's quite a few of them, perhaps the simplest and a good place to start in terms of practice is known as K Means clustering. That's K means clustering. And we need to say something about the name, what does K means mean? But perhaps we'll get that just by having a data set first and then thinking through the algorithm. So let's get back to a dataset. Let's choose perhaps a different couple of features now. I'm just gonna want to go for x one x two again two dimensional, just for the sake of giving a visual explanation. What should we go for now? Let's look at, um, we've got retail, we've got finance. We've got healthcare, why don't we look at sleep? I call it like sleeps. We've got, maybe just have a heart rate down here and maybe let's have a look at sleep length up here. So the total duration of sleep, right? Total duration sleep, and then the heart rate of person sleeping. We gonna have a long sleep low heart rate. So up there now, presumably a longterm low heart rate, that'd be a good quality sleep. Then we have a low heart rate short sleep. That's call that being hung over I mentioned now and they will have a random mixture of groups all across the place like that, right? Remember the goal of clustering is to arrive at an understanding of how this data here is group, which I'm here representing as these purple circles. So let's just draw a few on just a second. Now, the question is, how is the auto going to produce these circles or how he's gonna produce an understanding of the layout of this data? So let's go back and let's talk through it. So K means clustering, is a clustering algorithm into which you choose at the beginning, the number of clusters, the number of groups that you expect to be there. The symbol level uses K. So K here is going to represent the number of groups, we expect it to be here present and there he said, well, I might should I cheat? And should I say, well, no let's go for two. Ooh, I don't know how much I want to cheat here. Let's go for a little, let's go, let's simplify this problem quite substantially and not go for five. Cause it would take quite a long algorithmically to do. Let's just put in two groups here and let's say K equals two right. Now what you do at the beginning of the process is you say how many groups you expect to find here too? What it will do is I'll place two points at random on this surface, which are candidate centers of clusters. So I just explained that. So I'm gonna put two points on here at random. So I suppose the point here, point in yellow and a point in yellow now at the moment this is the hypothesis, if you like that, there is a group at yellow 0.1 and another group at yellow 0.2. And as you can see the reason, so this is the first step in the algorithmic process. And it's to put on key points onto the surface, let's go for steps. Step one, randomly place K centroids onto the, you know, features space is the technical term here, but here we just, just the graph area right. Now tell me a centroid meaning center point center point technical domain center point. So, hey, we pay two points on two center points. Now, the way we allocate observations to these centers is that we draw a line between these two centroids and we split it perpendicular to the line we're drawing. So I'll show it to me, but that's, if I draw a line between these two points we gonna do is split the surface in the mid point of these two. So we split the surface here, right. Was actually disappointingly because it's actually, let me try and make it a little bit worse just to give you a sense that there might be a few more steps needed. So let's just spit it that's at a few points over here and split it here. What you can see here is above the line. There is a group of points and below the line, there is a group of points, with the group of points below the line is spread across the entire feature space. So let's have a look at that. So on above the line we have just a cluster over here. That's not too bad but below the line, we have this full range this huge range of points. What does that tell us? It tells us that this particular choice of split of saying, okay, on this line above and below, that does not cleanly separate the observations that we have. So if one's funded observations is spread all over the place. That's unlikely to be a good cluster right? Or really cost at all, spread everywhere. So we gonna do now a step two in the algorithmic process is we gonna choose a different way of splitting. And how would we do that? What we gonna do is basically move these points so that they are move to towards the center or the mean position of the observations, which lie on both sides of the lines. Let me just say what I mean by that. So let's do the second step in red. Well, the mean or center position of the black points above the line that's over here. And the mean position will send a position of all the black points below the line. That's probably around here which is to say, it's essentially halfway between the ones in the low corner, when the ones in the upper corner. Now we will do the same thing again, which is to say, we will draw a line perpendicular to the center point, split here and we have above and we have below. Now this is a very good split, as you can see why, because the points below are all groups together and the points above are all grouped together in red here. And then the mean position of the ones below the line, that's a good center point and the mean position of the ones above the line. That is a good center point. So let's just summarize what we've done. So step one, randomly placed K points, step two we're going to, uh, move centroids to the mean positions of the column candidate candy date clusters, right? So that's step two. So we're moving the centroids moving those yellow points to the main position of the black distribution points. And I'm gonna repeat that step three, repeat two until, until our central is don't move. Call it stable if you like until centroids, uh, stable. So let's just give it, let's just review that just a little bit. So we gonna start out with two points, randomly placed. Let's split the surface above and below. Then we gonna move these points closer to the massive black points, right? Then we gonna keep moving the points and moving the points and moving the points until effectively, they are centered within the mass two masses of black points, which are evenly divided, revealed limitations or algorithm which I think will we'll move on to now. 