Practical Machine Learning
The course is part of these learning paths
This course covers the concept of unsupervised learning within the context of machine learning. You'll learn the fundamentals of unsupervised learning and how it differs from supervised learning. We'll cover the topics of clustering, k-means clustering (and its limitations), and dimensionality reduction.
- So K means clustering, accepts a K which is a number of clusters you expect it to be in a dataset and it will find centroid or mean points of those candidate clusters. Now, I think in that description you may be able to detect there's gonna be some limitations to that approach. So in this video or this section, I would like to talk about some of those limitations now we get around them. This is gonna be K means Clustering Limitations. Now the first semester obvious one perhaps is what if we choose the wrong K. So let's just have a look at a data set here. So let's start with nice obvious clusters but maybe a nice diffuse cluster in the middle there. But as I said, K2, what will the algorithm do? Well, It may choose a center down here in the center up there. And I'm not gonna say definitely that's what it will converge to stop at, but maybe what did I tell you? Well, you've got this huge spread of points here which would be taken to belong to cluster one and another spread , which should be taken to long cluster two. If it doesn't seem a good partitioning right? I mean just, Look at that, Without that partitioning on, you've got three discrete clusters, obviously, visually, you know, that seems like a good place to stop. So what can we do if we don't know how many classes we're gonna find what can we do? Well, we can just try lots of cases right? So what we're gonna do here methodologically, I'm gonna try K2, K3, K4, 5, all the way as many as you like. Technically you could go to the number of points you have. Typically probably people would go I know maybe 31 or something, very large number whatever. Now so we will try the algorithm, try the clustering algorithm with a different number of K's. Now what typically will be reported alongside the cluster centers, is how spread out the data is around those centers. So when I run the K-means algorithm on my dataset, output will be some centroids, and what's known as an Inertia or spread of data. And our term here Inertia is very specific to K-means or with no one else. As far as I can tell uses inertia anywhere else. But often here with K-means it's called inertia. It has an analogy to physics we don't need to get into, but anyway, we got this centralized and get that spread. So if I try K2, what I would find is, I got a centroid over here at this position I'd say that position is two, three, got a centroid over here that position, let's say it's, five, six, right? We have two centroids and we would see that there's a spread here. That's why I don't know how to characterize the spread, let's say it's going from zero to five. Let's say the is five. It's not exactly what the formula would be, but there we are, and on the other side let's say, we've got, we're going from about five to six, lets say the spread here is about one. So if we say inertia one, inertia about five just to give us some majority of simple numbers to work with. Now, as we increased K, what we would find is the inertia would drop and it would always drop, cause as you increase the number of groups that things can belong to, when you consider how spread out those groups are that spirit is always in the lower and lower and lower, You could get all the way down to, the number of points that you have the number of observations you have and you basically have no spread, because each person is its own group. There's no, no area around that point right? Just, one point is one group. Now so if we look at how, how inertia will change with K, we would see a drop-off effect. So here, let me just choose K, on the horizontal axis and here inertia, I like to, let's call it spread. I'm not very fond of the word inertia here, it's in spread and what you will see here, as you're trying not to numbers, lots of different Ks and you are reporting, the results of each of those, you will find if you plot spread versus K that there will be a drop-off, let's go for K of a zero, well, you've probably one group with a minimum. Let's have one two, three, four, five, six, seven, and so on. Now with this data set above, I think we get a big drop off at three groups. So as we go down to three, what we would see here one you got a large amount of inertia, two a large amount of inertia, three would start to see a big drop-off and then diminishing returns. So what this is telling you is that around three, perhaps even four there's a big drop in the degree to which data is, spread in these groups. So let's just try and put that in English. If we can explain why this drop is occurring. So if we imagine then what did we have earlier? We had a sleep, we had children. We had things that come in breeds or genres would be a good one. Let's think in terms of the fairground example we had ratings to say. Now what we're seeing here when we cluster is we we're asking one question are the people coming into our fairground, do they have things in common, which causes us to see more observation in a certain range of ages and a certain range of ratings in other words, do we see for example, young children, high rating, all people, low rating middle-aged people doing other things say, right? Or in fact, do we see an essentially random ditribution? In other words, is it just as likely to be a young person, an upset to the quality of a fair ground, as an old person who enjoys it? If there was just no connection between age and rating, you would just see this random dispersal of disposal or dispersion of ratings, and we don't, we see them clustered the grouped together. Now, as we increase K, what we are saying is, that well, if I start by thinking every person's same every person belongs to the same group. That's a very, very big group, very indistinct group. Everyone from 10 to 100, as I increased K what I'm doing is I'm forcing people into small groups which, presumably have more commonality amongst them. Until I get down to certain K, which seems to fit the underlying patterns in the data the best. And as I increase K beyond that point, all I am doing is, I'm yes I'm decree yes, I'm concentrating the commonalities that people in those groups share, but in a way I have much diminishing returns. Like maybe I take a 10, 11, 12, 13 year olds. Maybe that would be a group of K with three, I grow, I go all the way down, I go all the way up Sorry, to k = 10. Suddenly I get groups of 10 year olds, is knowing how ten-year-olds rate something compared to knowing how 10, 11, 12 year olds rate something, is ten-year-old behavior more distinct, more interesting, more valuable to distinguish, compared to 10, 11, 12, probably not, probably the 10, 11, 12 group, they basically behave in the same way. And I'm just going for more and more groups that I want to partition people into, I'm not really learning anything more. And that's what you see with this drop-off and this is a common rule of thumb in machine learning, which we can call diminishing returns. Like, there isn't really a principled K there's no true K, there isn't like, is this number of groups. It's at a certain point in the process, When we are trying to understand the data, we turn a dial and at some point there's diminishing returns when I was turning it. And we go well, maybe I've learned the most I can learn at this point. I'll just stop there. And that's the rule of thumb you would use with, especially with K-means clustering. So, just to restate that what we're gonna do, I'm gonna take a data set we don't know how many groups are in, try it out to different K's, let me get diminishing returns, let's just say, that's our number of K's, and that gives us our clusters. So we can get around the limitation of having to know what K is at the beginning, I just try not to different ones, having a rule of thumb for choosing the best, and then selecting that. That's the first limitation that is finding K, how do we find, K. And as I said, most of the machine learning, it's just a bit of a, at some point, it seems reasonable. Now the second limitation here with this algorithm, which is not specific to this algorithm, is that it only gives you centroids, it only gives you these centers of clusters, so let me just show you mean by that. So if I've got these points here, I get this purple center. So that tells you, what the center of the cluster is, and you have the spread, so you can do plus minus, or you can go center, cross there, bit of spread, You can say, well, you know, send a point, the most characteristic point, if you like of the team group is maybe 11 with a rating of nine. So what that tells you is maybe the most point, the mean point in my group of young teens at my fairground is an 11 year old, who rates a film, or a fairground ride or something 9 out of 10. That's only a characteristic point. What you're not getting is a full distribution which is to say the full answer, which is what I, drew, in the clustering video, this purple circle what we would really like is the purple circle, We wanna know precisely where the boundaries are in the whole range. Is it, what about a person who's on the edge of one of these boundaries should be considered them to belong to this group because they are close to the center or belong to some other group, Cause you've got to close to that center. Well, if we had the full boundary, the boundary where the machine believes, the group begins and ends, then we can just use the algorithm to decide you go in that group, you're going directly to that group. We only have the center points, It becomes a harder judgment call. Now there are other algorithms, that produce these full circles and far more complicated. Gaussian mixture models can be involved, The algorithm called DB scan, things to do with hierarchical clustering. So you own this course however, we will just stick to K-means to illustrate the idea, and then perhaps in an advanced session, we may look at one of those other algorithms.
QA is the UK's biggest training provider of virtual and online classes in technology, project management and leadership.