Practical Machine Learning
The course is part of this learning path
Supervised learning is a core part of machine learning, and something you’ll probably use quite a lot. In this module, you’ll learn more about exactly what it is and what its capable of, before we move into more detail about the nearest neighbours algorithm, hyper parameters, and distance functions and similarity measures. We’ll end the module off with logistic regression, the method and workflow of machine learning and evaluation, and the train-test split.
- Let's now talk about how the K nearest neighbors algorithm works, and in particular, perhaps have an interlude about similarity and distance, which are the notions that K nearest neighbors is appealing to, in order to compare points, and select the ones of interest. So let's talk about similarity first, as a kind of prelude, and then come on to K nearest neighbors and how it's going to use that. So similarity, or distance. So distance here is the key word. What we're gonna do is look at certain formula for distances that come from, essentially, kind of geometry and you use those formula to give a geometrical interpretation to data sets, and by doing that arrive at a very simple formula for comparing points, comparing observations. So it sounds more complicated than it is, but let's have a look. So let's start by looking at some distance functions. Now we've already seen some in this course. One particular one is the Euclidean distance formula, and here we are. So suppose I have a point, you know, the two axes, an axis, let's say X1, and axis X2. Now when we're comparing points, we're usually comparing features. Since we don't know, so you know, the general use of distance forming machine learning is concerning comparing observations, which are, here, multiple features of something, so, but not always, but in this case, let's consider it to be two Xs, two features. So what we're gonna do is put a couple points on here and ask what the distance is between them. So if I put a point, you know, I'll do it, I'll just do A and B here, A and B. Now, there's many ways of answering that question, of what the distance is between them. I mean, you know, I could try, you know, if I were restricted to only drawing arcs, if I had to draw lines that look like that, that would be one way of computing a distance. If I was restricted to a grid, suppose I draw on here a grid, one, two, three, four, five, six, seven, eight, nine, say and then again, and you know and I was restricted just to these sort of grid lines, well then where would I have to go? Well I'd have to go, well, sort of along up, along up, along up, along up, and so on along particular grid, in order to get from A to B. So in that grid case, it's more like a road network, especially an American road network, where you kind of just drive over houses. So there's sort of, there's the intuitive obvious distance, which is a sort of crow-flies distance like that, but then there are other kinds of geometrical constraints we can place on our travel. That means that we arrive at different distances. So what is this one, the first one? The first one is called Euclidean, Euclidean, and that's after Euclid, who is a famous ancient mathematician, with who geometry, who arguably invented geometry. In any case, so the Euclidean distances, this straight line distance between two points. Now we have another word for this as well, which is the L2 Norm. You might think that's a strange phrase for it. The norm here means kind of distance or measurement of some kind. So that's a distance measurement. And L2 will, we can see where the two will come from in a second. So, okay, what's the formula for this? Well, so what we're looking for here is some way of taking our numbers in X1, taking our measurements in X1, taking our measurements in X2 for both points, and then coming up with a formula for the distance between them. So let's just be precise about that, you know, so the point A is just gonna be some numbers. You know, there's gonna be a number for X1, a number for X2, so let's say it was, you know three, well, let's maybe do it more visually accurately. So if this is one two three, four, five, six seven eight nine, and this is one, so A is at 1,1 and B, I believe, let's say B is at 9,9. I think that makes things simple. So A is at one, one and B is at nine, nine. Maybe, we could do different numbers if you wanted to. Maybe you could say even two etc, right? So what we're looking for is a way of taking these numbers and coming up with an answer. So the distance function, we can call it, D, if we wish, is going to take in point A, take in point B, and somehow give us a number, which describes the distances. So what is that? Well, we're gonna do it, we're gonna say, so what's that? It's just a triangle formula, you know, the Pythagorean formula, where we say, "Well, if this is the base, "that's the height of the triangle." Yeah then luckily H here is may be a little... then this will be. So whether you got some, you know, technically, we might say the diagonal there is the hypotenuse, but we're ignoring all of that language, base, height, and distance. Those are good letters. What is the actual formula here? Well, it's the base squared plus the height squared is equal to the distance squared. Now we're not interested in the distance squared, we're interested in the square root of that, or in particular, just the distance. So what's that? That's where the distance, then, is equal to the square root of the base squared plus the height squared, right? Now, the remaining question really is what is the base? What is the base? What is the height? Well, the base is just this number over here, minus that number over there, so you can see here the base is eight. And what's the height? Again, same idea. One, so nine minus one, you can see the height here is eight as well. So that's going to be, let's put this in, so let's say it's B, I'd say it's the, well, it's the base, so the base is the first component here, right? So it's B1 minus A1. Let's say this is A1, A2, A1, A2. So B1 minus A1, that's the base. That's gonna be squared, and it's gonna be plus, so plus B2 minus A2 squared, and then we square root that. That's, you know, that's the answer. That's the answer. So that's the formula for a Euclidean distance or the L2 norm. So where does two come from? Well, it's because we're using a square in the distance formula. Okay, now there's one final piece of notation here that I want to introduce you to, if you're not familiar with it already, and that's tidying up the square root a little bit. It might not be, Might not considering it tidying, but anyway, what you can do here is you can say, you know B1 minus A1, squared, plus B2 minus A2, squared, sorry, rough plus, sorry. Oh no, no, no. This is the, this is the base bit. That's the height bit, and then we're adding those together, yeah? Right, now, what's the notation here? And that's two raised to the one-half power, one-half power, and that means square root. So if you're raising something to a power, let's say you're raising it to four, then the inverse operation is one over four. And that really is 0.25, but no one would ever write it that way, really, but you could. So I, what I mean to say there is it really means one divided by four, so it's not some milline notation device. And you might, you might see this, so one-half power. So that just means square root here, so the opposite of a square, square root. Again, if you have a cube root it'd be one over three, quadratic, you know, fourth root, one over four. So that's a distance formula, and that's the L2 distance formula. So we could say that's the distance for L2 in terms of A and B. So we're treating A and B as vectors here, right? So they're lists of numbers into which we're taking indices, or we could consider them as sort of genuine arrows in the geometrical sense. So, ya know, if I had a point over here, the point, let's say over here, you know, then that would be the distance between them, But I could consider, you know, A as being an arrow from the origin to A, and B as being an arrow from the origin to B, and so if I wanted to know the distance between them, well remember from our discussion of linear algebra, that this vector between them, let's do it in a different color, I've got red all over the place, so there's a vector between here, let's call that vector here, C, you know, C, it's going to be B minus A, I believe. B minus A, well, so that's going to be C. That vector, C, has the same length as this distance. So you know, we know the distance formula here, from these two points, right? So you could think of that as the base of your triangle, the bottom, and then there's the height of the triangle, and so on. So that distance, so C is the vector between them, and then we could say that the length of C also has this formula. So that length is just that formula there. So this is one notion of distance, and this will plug in to K nearest neighbors as a way we can compute similarity between points. There's a couple of others. The final measure worth mentioning is one that we've seen before, so I won't go through it again, and that's this notion of cosine similarity. So that's cosine similarity. And we've seen the formula there, which is the angle between two points understood as vectors to those points. So if we have, you know, point A, and point B, we can treat those numbers, where A is higher in this case, so it'd have a higher second number, so it'd be three, one, three, say. And B's a little further along, let's say it's two, on much lower, I'd say zero. We can understand those pointers as, actually, it's almost like pairs of points and there's a point here at zero, zero, and there's a point at where they are, and then we can draw an arrow between those two, so we have a vector to that point. And then we can say, "Well, what is the angle between these?" and that angle there has a formula of, well, cosine of that angle is the dot product divided by the length of the vectors, so sort of normalizing the vectors by their length, so, you know, the vectors we're actually comparing are not A and B, they're actually just, it's A, where we only have a length of one and B only where there's a length of one. So there's a one here and a one here, because we're scaling them, so the definition of that there is A divided by the length of A, and that gives A a, that gives A, a length of one, so it gives A a length of one. Right, so, okay, and then if you wanted the angle, often you can just sort of compute this number and use that number, so it isn't the angle itself, but it's, it's, you know, reasonably just, it's a ranking by this number, is about as good as ranking by the angle. So if you were to compute similarities, you could use this number rather than the angle, but if you wanted the angle, it will be the inverse cosine of that, Which would be, no, we can do it this way, or you could use, another word for the inverse cosine here is the arc cosine, and this is gonna do the opposite of what cosine does. So we've discussed that a little bit, as well. That cosine has this shape, so if you are at this point, you know, so if you have this point, you can look for what point you are there and vice versa. So with that being discussed already, so that's, I'll just give you a preview that you can use the angle as well to compute similarity. As far as K nearest neighbors is concerned, you know the intuitive thing here is just the Euclidean distance. That's kind of the distance I've been using in the visuals really, that if I draw a graph on here, let's say it's a classification problem, so here we've got some green points, here we've got some red points, and I put a test point on here, then the question is, you know, "Which is the K closest points?" Let's say K here is some number that's convenient for the visual. So suppose I draw, you know, so the idea then is to compute what the, so what the algorithm is going to do then is compute the distance to every point on the plane, and then select the key ones. So, you know, so the distance to here, distance to there, distance to there, distance to there, distance to there, and so on, every single point, using the Euclidean formula if we set it up to use the Euclidean formula, and then it's going to choose the closest. So let's just tidy that up, and do a simple example. So if we have a point here, two points here, maybe do one over there, and we put a test point on, what we're going to do is consider all of the points first, to compute their distances. So here's a distance, here's a distance, there's a distance, there's a distance, there's a distance. So now this here, this one, you know, is the base height, base height. So it's that formula there of, you know, suppose this were five and this is, well let's say that's six and seven. Then we've got, for this distance, we've got, you know, six minus five, squared in the base. In the height, we've got, let's say this is five again, and, well, let's say six again, well, I can be doing just like a different number, five, six, seven. Then in the height, we've got seven minus five squared, and then square rooted. So the first one is one, the second one is two, so that's the square root of three, which I'm guessing is about 1.5, 1.6-ish. So if we do if we do three, square root, 1.7, okay 1.73. And just to make the point earlier about the power notations, go to scientific, go to three, X to the Y. So the next number here is power. If I put 0.5 here which is a half, then we get 1.73 as well, for both of those performing the square root. That's 1.73, that's 1.73. And then I'm not gonna do the remainder, you can have a go yourself if you wish, let's say that were four, and let's say this were four, why not? Then you're gonna rank those points by those distances. So let's say this comes out to 1.8. Let's say this comes up to 1.6. I'm gonna have two, why not? So if wanna select the K nearest, using this formula, we're gonna set, we've ranked them by their Euclidean distances by their L2 norms, so we know 1.6 is the closest. Well, if we wanted K equals one, then the closest would be this point, and so the prediction would be green, or whatever the green case is, a positive case, say, so we make that positive. If we want K equals two, then the two closest points are red and green. So it'll be a random prediction. And the three closest points are two red, so two red and one green are the K equals three case, and then that would be a prediction of red, or the negative case, say, because that would be the most common within the top endpoints. So the formula for the model here, then, our F hat is taking in a point, and the model is parametrized by the entire data set, meaning that all the model will use is that entire data set. So, that's a capital X and Y, or we can give it capital D to make that distinction clear, I don't know. Let's maybe try capital D for now. The data set. And so what's the model? The model is to find the minimum of some distance measure, let's say the L2 distance measure, or in general, D, let's say L2, considering, you know, the Xs in the data set, and our X to find the minimum entry in the data set by comparing our X and all of the Xs in the data set, and the thing we want to return is the Y here, and that's the pretty good definition. So the way we read this formula here, is to say the minimum Y for which, let's write that down, the minimum Y, for which L2 is a minimum. L2 what? L2 on or comparing the data set Xs versus the unknown x, or the new x, the unseen point, where we don't know the Y. So here is where you would plug in your different distance function, whether it be L1, L2, probably able to use cosine similarity as well, if you wish.
About the Author
Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.
His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.