Practical Machine Learning
The course is part of this learning path
This course covers the concept of unsupervised learning within the context of machine learning. You'll learn the fundamentals of unsupervised learning and how it differs from supervised learning. We'll cover the topics of clustering, k-means clustering (and its limitations), and dimensionality reduction.
- We've looked at clustering, which is one of the major areas or kinds of unsupervised learning. Now I wanna look at dimensionality reduction which is the other. Dimensionality reduction. So what's this? This is about taking a dataset, dataset of features, no target here and compressing it. How do we find compressing? Or we might say, it is the number of dimensions or columns, you can read this as columns for now but the number dimensions in our old dataset, needs to be lower than the number of dimensions, in the dataset we get at the result of this process. We call this a new data set. So we can think of that as a definitional statement, of what dimensionality reduction consists in, all right? So let's define dimension. So I'll give you a little slide. I said a little side point, that I said dimension means column. But that isn't quite right. We need to say why that is. So technically speaking, a dimension is a way that an observation can vary independently of another way it can vary. So lemme give you some examples. So the intuitive case of a number of dimensions is dimensions in space, right? So in space we have three dimensions. What does that mean? What it means, it is possible for me to move this pencil say, towards you without moving up and down, and it's possible to move it up and down without moving it towards you. And likewise, side to side. In fact, there are only three ways that I can do that. If I move diagonally, I'm in fact moving a little ways to the side and a little ways forward, right? So there are three principle ways; up and down, forward and backward and side to side that I can vary, and say an observation of the position of the pencil without varying any of that. Now datasets don't... Well, how do I say here? But datasets let's say often don't necessarily come with their columns being fully independent. In fact, of late it's relatively rare. So if I was to give you the position of the pencil and I didn't know, suppose I was a little bit naive, suppose I hadn't done high school physics or something. And I had to tell someone where a fly was in the room or where a pencil was in the room, what I might do if I didn't know, would be take a ruler and measure the position from lots of different objects. Maybe I'd measure the position of pencil from the corner of the room, maybe from the floor, maybe from my knee, maybe from the television, maybe from lots of different things, right? So maybe I'll give you all these different positions of a fly or a pencil, whatever is in the room. Two, three, four, five, six. I would say, you know what? It's about two meters from that corner, it's about three meters from that corner, about five meters from that corner, six meters, two meters three meters, four meters. Now, as you might be able to tell perhaps, that there's a lot of repetition in those numbers, right? Why? If I give you this pencil from all the different corners, if I move this pencil, it's moving relative to all of those corners. There's actually duplicate information there. I can reduce it information down just to three numbers without losing any description of the position of the pencil, right? So this idea that maybe the columns we're getting in the data sets we collect, in some sense repeat information that we can expunge if we reduce the number of columns down to ones which are fully independent. Let's give you a little diagram of that or something. Maybe we have a little fly flying around, all we need is three numbers. The height, let's go for depth, let's go for width. Or something like that. What we need is these three numbers. And so what we're gonna do is, we're gonna take a data set which has 100 different columns and reduce it down to one which has three. That would be a form of reduction. We're reducing the number of columns down to one which is fully independent. Now it might also be useful for us to go beyond the number of full independent columns. What does that mean? It means that it might also be useful to lose information. Let's think of a case here. So classic example for dimensionality reduction is with images. So here's an image. Here's a little smiley face. May let me even draw also a little frowning face. Now, if our task in the end is to distinguish smiley faces from frowning faces, almost all the information in these images is useless to us. For example, let's look at above the eye line. Above the eye line, these images are indistinct. So in fact, we could just delete that information in our dataset. And deletion here is a form of dimensionality reduction. We can just delete that, and distinguish these kinds of images as well as we could before. So we can get rid of this. Now, here we have technically lost information. It isn't as if information was being repeated or anything in our dataset. It isn't as if someone gave us too many numbers. I think we only required three. Actually we're losing information about the eyes. But because our task is only to distinguish smiley faces from sad faces, actually probably we could lose most of this. We only need in this case, some center pixels, maybe let's say this center space here, is all we need. If it's at the top, it's sad, and if it's at the bottom, it's happy. So there are two cases where we can use dimensionality reduction in a way. There's the case where we suspect there's lots of repeated information. And so we're just gonna throw columns which repeats somehow that information. And there's the other case where actually in some sense, we have too much information, too much information is present. We don't need it all and actually throw some of it away. Before we look at some more examples to make this point to give you some sense of application, let's look at how we can do dimensionality reduction. So I've already given you one technique, which is just to delete a column, which is very powerful, very good technique to do that. But let's look at an algorithmic technique, something more than just a practitioner press and delete. And a core algorithm in this area is known as principal component analysis or PCA. That's principal component analysis. So how does this work? So it just reminds us of what we're trying to do. We take this input into an algorithm, we can draw it using our A-notation we had. So if I was to take this input, the features, right? Here I'm gonna call these old, meaning before we've processed them. An output of this algorithm now, is some new set of features, new. And the goal here is that the number of dimensions, the number of columns mostly, roughly, in our new set, is smaller than dimensions in our old set. Let's take a minute here to look at this formula and compare it to the supervised case, right? In the supervised case, the output of the algorithm is a model, a function basically. Here the output is a dataset. They're very different. Input into supervised process is a dataset, output is a pattern, a model. Input here to PCA is a dataset. Output is a new dataset. which has a different number of... Well, in principle has a different number of columns that were somehow more helpful to us in the original set. Let's explain how we do that. So let's just take simple regression case or simple dataset that we might do regression on. And look at a couple of features, X1 and X2. Maybe here we go for the sleeping example. Heart rate and the length of sleep. So this would be something, for example, Fitbit or some health tracking company. Some health tracking company would be interested in for the quality of our sleep. Look at heart rate data and the length of sleep data, say for example. Right, what we're gonna do here, I'm gonna try and get rid of a column. So we've got two now. Can we get rid of a column without losing much information? Let's have a look. So what we might find, is that heart rate is actually quite correlated to your length of sleep. So that's how we look visually. Numerically, let's have a couple of columns, X1, X2, heart rate and length. What we see here, for the sake of making my calculations easy, let's just say, this is gonna be two, two, three, three. Well, I mean, we wanna make it realistic. We could even say 62, 62, 63, 63, et cetera. Right. Maybe we times the length by five, we add one to it, that sort of thing, right? For the sake of simplicity, let's go with that. Now, what you can see immediately looking at this, is that length... if length is always heart rate, we can just delete length 'cause it's the same information, right? That gives you a sense of how we can compress things. But suppose it's a bit of a kick on it. There is a little bit difference and we want maybe that little difference. What are we gonna do? Well, we're gonna look at this dataset here and basically draw a regression line through it. So we're gonna draw a regression line here in orange through the data set and then we're gonna treat this regression line as a new axis. Then if I treat this regression line as a new axis, I'll explain what that means in just a second. So this regression line and then we draw another line perpendicular to it. And we're gonna treat these two new lines as a new axis. What do I wanna say about that? So to understand that idea that we can remeasure data using a new axis, I think we have to take a step back. Let's take a step back. So the goal here with PCA is to produce a new set of columns. So we've got these columns here, X3 and X4 here represented as axes, and these are new axes. And when we look at the data points and how they fall on these axes, we'll actually have new numbers for these observations. Let me make that point visually. So if I look at, for example, this first black point on the heart rate axis, let's say the heart rate here starts at 60. On the heart rate axis, let's say that's 61. And likewise for length of sleep let's say this is again 60. 60 down here, 61 now correspond to something else, 61, 61. But having drawn now X three and X four, what we can see is X three. Let's see that on the same scale. So start at 60. Okay, so this point may be in X three is still 61. But now look at how the point measures vertically in X four. It's far closer to the axis. Basically on the axis, right? If you look at that orange line, basically on the axis. Say we're measuring from zero, why not? Is this sort of 0.5? So if the original point had this sort of this... It was quite high about the axis in the original column. In X four, it's quite close to the axis. This applies to all of these data points. Like using the original black lines, they are quite far along in X1 and quite far along in X2. But when we use X three and X four here, let me just rotate the visual for you. You can see actually that there's very little variation above the axis in X4. What does that mean? It means that somehow if I project all of these points down onto just X three, in other words if I ignore how high they are, you will notice that they are still distinct from each other. So that means that the only information I need about this data set now is essentially the information in X3. Yes, if I track their heights in X4, if I kept track of those heights, that would distinguish them a little bit. But they are sufficiently distinct in X3 that I can throw away X4. So how do we do that? How is that being achieved? Well, what we have done is we have used a regression line which somehow captures the information of two variables, and we have used a regression score, that is the position on that line to summarize those two variables that gives us X3. Then we have this X4 column and X3, because it is the result of this regression, contains information in both of those variables. And so the distinctiveness of the original data points which required two columns now requires one. Now, there's a few technical mathematical things we may wish to say here but I think what we'll do is we'll leave those technical areas to a different video. And I want to just point out some ideas that have gone on. The first idea is a change of basis or a change of axes. What we had in the original data set was heart rate and length. What we have in the new dataset is actually a combination of heart rate and length, which we call X3 and X4. So we have changed the basis. Change the columns. Change the system of measurement we are using to record our observations. And in this new system of measurement, it turns out that most of the variation between our points can be measure just horizontally. And so in fact, we could throw away the vertical information, retain only one column and still retain almost all the information that was present in the original data set. Now this mathematically is actually known as a rotation and you can see here why it is a rotation. Because we are choosing a new axes that are relative to our dataset rotated, right? You can see here we rotate it. Now, what more do I want to say here? So that's the algorithm. I think I would like to give you a little bit of a physical intuition for this argument as well. Let's see if I can do that. So the idea is here, that physically speaking, let's say the cluster points that we have added points, all lie along the edge of this pencil. The idea is that maybe the system of measurement we were using, heart rate, whatever units those are in. You see that system of measurement wasn't ideal. It duplicated a lot of information. Let's say I choose a three-dimensional set of axes and I measure this pencil in three dimensions. I measured, let's say here. So along my thumb, along my finger, along my finger and get me three points. And what you will notice here is that this pencil does not change its depth. So it's not moving this way, only along the pencil are you're moving up. I'll just say up, underside. So there is no depth information. So if I have three axes that are all askew, to know where this tip is, I would need a point on this axis, a point on this axis. I would need the point because the axes are askew from the pencil. I would need all three points in the position. If I move these axes into the pencil so that I have just a horizontal measurement, and just a vertical measurement, and the origin of the two axes are at the base of the pencil, then I can just say, this is zero, zero, zero, say. And that is one, one, and I don't need depth. I had depth in my dataset. Was it essentially an illusion. It appeared if I needed it because how would I say what this is if these are my axes? Perhaps they were a little bit this way, a little bit that way, a little bit that way. But by moving my axes, I would say, rotating them, mathematically speaking, rotating them, I can eliminate the depth information and use only these. And what we have done here mathematically, it's pretty much identical. Now of course, we're meeting with heart rates and length and all that sort of stuff. This has been metaphorical sense in which we are rotating. But corresponds exactly to the physical case, by analogy. So to wrap up then, let's look at the output of the PCA and let's be clear about what we're doing. So when we do PCA, what we're going to do is input some original set of columns, X1, X2, X3, and so on. We will get as output a new set, X1 prime, X2 prime, X3 prime and so on. And on the first phase of the pass, what the algorithm will do is it will just give us the same number of columns as we had in the original. So we just rotating that first, but hopefully they will be ranked in terms of how distinctive our observations are, using only that column of information. So in other words, the first column will be the most informative column, the second column will be the second most informative, the third column would be the third most informative. So in this case, in this pencil, when I've rotated the axes, it will give me this output in three axes, but what I will find is that the third axis depth is all zero. So it is rank them in terms of how distinctive the pencil is, one, two and three. And the last axis in its ranking, will contain the least variation in the points we are observing. What that allows us to do then is remove these later axes, just delete them, and retain the most useful axes for prediction. Before summarizing, I'd just like to discus a few limitations of PCA. The most important one here is that the new axes we get are typically speaking almost always, not always, but almost always less explanatory, less interpretable than the original. So in the original, we had the heart rate and sleeping. We had age and film rating. We had length of the cat's whisker and the size of its paw. Physical aspects of observations we are making, things we can explain to others. When we perform a PCA, what we are doing is we are using correlations across columns to reduce them. So that is what we're doing. And the way we do that is by reassembling new columns from combinations of old columns. That's what a regression line is. It's just a combination. When we combine length of the fur, size of the paw, when we combine a person's age and how they've rated a film, when we combine the number of days in their insurance policy and the size of their insurance claim, the column we get out is uninterpretable. What does it mean to have 30% of one column, 20% of another column and 10% of another column all added together to produce a new value? Doesn't mean anything in fact. So in this new basis, after this transformation, after we've chosen these new axes, it's probable that these new axes don't really correspond to anything physical or interpretable. They're just optimizations. Numerical optimizations that give us a different set of numbers that contain the original information, but in a way they can't really be interpreted. So a PCA is a good device for compressing data for prediction, because in prediction, we only care about the why. We only care about the result we're getting. But it can be a very bad device to be using in any situation where at every step of the learning process you need to explain how the data is involved in the prediction, how it gets used and what impact it has. I mean, if you do a PCA on a dataset and you use 10 columns reduced from 100, no, 10 columns are a peculiar combinations of the original 100. And then a lawyer asked you, why is this the prediction you're making? You really have no answer to that question. So PCA can destroy your ability to interpret or explain a dataset. In fact, also though it can improve your ability to interpret.
QA is the UK's biggest training provider of virtual and online classes in technology, project management and leadership.