Practical Machine Learning
Supervised learning is a core part of machine learning, and something you’ll probably use quite a lot. This course is part one of the module on supervised learning, in which you’ll learn about exactly what supervised learning is and what its capable of, before moving into more detail on the nearest neighbors algorithm.
Part two of supervised learning can be found here and covers hyperparameters, distance functions, similarity measures, logistic regression, the method and workflow of machine learning and evaluation, and the train-test split.
If you have any feedback relating to this course, please contact us firstname.lastname@example.org.
Let's now consider the nearest neighbors algorithm for classification, and maybe say something a little more illuminating or helpful about what it means to have these distinctions in the input and how we might have a kind of grouping within certain, say age ranges or something like that. And how therefore the K nearest neighbors algorithm can, do better than even clearly regression at some times. So let's start there. So nearest neighbors for classification.
Now hopefully maybe you can see that the approach will be very similar. You know, it's a very, very simple approach. Let's think about a model of one dimension in the feature input or feature space. And let's think about a model of two dimensions. So if I have a model here and it has one input, then I'm gonna get you know, prediction based on that one input. That seems fine. Let's say x one. And I might have a model. Also, another f let's call this f one and f two, f2 let's say could have two inputs, x one and x two.
Now, so just some examples here. Let's say in the first case, we're looking at predicting with this classification of course, we should have discrete options. So for each case, let's do one as a binary and one as a multiclass. So the first one is a multiclass, just a bit of difference here. So here we're gonna say the tangent is gonna be we could do red, green, blue, we could do dog, cat, mouse, we can do as many options we wanted. Let's do some cities gives you a sense that we can expand it quite a lot. Already in countries, countries are the easiest for me to do abbreviations.
So let's do UK, France and Germany, then you can see you know dot dot dot U.S and so on. And now that's the first case. Let's say the second case, we'll just do a binary one for just for the sake of simplicity binaries is the most common variety of classification. So here we just say, you know, like and dislike, or maybe or something like that, so like. Oh, well let's do another different kind of example. Let's say fraud and not fraud. So fraud, and not fraud is a common example. And you can imagine so you can imagine, I'll give you some setups, once you decide what the features are, you can work backwards as to what kind of problem would lead to those features, right? So fraud not fraud. Let's do fraud be the positive.
Let's do fraud be the positive case and not fraud be the negative. And let's do these, let's just call these zero, one, two or something. Zero, one, two. All right, so let's think about what feature we might have in this case. We've got x one is going to be well, what would what could we be classifying into countries? We could be classifying people, we could be classifying products or into the origins of the product. Let's classify people into countries. And let's do that based on please save last traveled possibly, or something even more obscure or different, perhaps maybe there. I don't know whether we didn't do it based on age because we better than random because different countries have different mean ages.
Well, I don't let's do let's do the price of same product. So maybe price of steel or price of some good, and maybe we could tell whether it's come whether it's come from factories in the UK or Germany or something like that, based on the price of the product. So that's something, right? Just for the sake of making the example quick. And then x one and x two here, let's have fraud not fraud. Now what on what basis can we judge fraud? Let's say the day since the insurance was purchased days from purchase, right? So if you bought an insurance piece of insurance for a laptop or something, and you bought it z in a day ago, then you and you claim, then it's much more likely to be fraud than if you bought it three years ago, two days from purchase.
And number two, let's say the age of the claimant. So possibly younger people maybe, I mean I don't know, maybe it's people of certain ages are more likely to commit fraud. I don't know about that depends on the kind of kind time of talking about right. So right, let's have a look then. So we got two examples. Now when we join a classification problem, the target is we visualize that as a color rather than an axis and that's because the target is discrete. So let's have a look at the first one here. So on the first one, we only have one line, which is the x one which is the price. So there's one well I should do this a different color maybe I don't know, let's do blue.
So here we have pound it says the price which is pound sign, quite like doing blue usual signs wherever possible. So nicely abbreviated. So there's that we just have one axis. So what was what does it mean then to have different countries based on price here, we're giving color. So we could do red, green, blue or red, black blue, I suppose. So if we say the UK is red, these could be UK examples. And we say that maybe France is green, but do it in crosses just to give you a couple of differences there's gonna be France has a quite a wide price range for its products. And possibly, Germany has a very, maybe Germany is quite expensive, say and they're all quite narrowly together. So for whatever reason, you know, perhaps German manufacturing produces high quality goods and therefore we seeing high price for them, yeah. Except for the side, right. So where are we now then?
So to solve that, right, so to solve a classification problem, or is to learn some model f hat, which allows us in the future, to tell or to predict what group something will be in, if we don't know it. So let's put on a little purple dots somewhere, let's say a circle. That's gonna be our little unknown point. This is the unknown. Now what I mean by unknown, there is no way we could say unseen. There's a lot of words you've got confusing about this. But what we mean is that, when we're coming to predict it, we don't know what it is, what its target will be, but as we know it features, so it's the so it's the person or the product or whatever, for which the features are known. You know, we know what age they are or something. But we don't know what kind of whether they are committed fraud or something. Okay, so good. So it's the unknown point.
Now, if it solved this problem, what we'd be able to do is tell which group you belong to. So how do we solve it using the nearest neighbor approach? So for the nearest neighbor approach, where we just follow the title of the algorithm really, right, we find the nearest neighbor. So in this case, you could think of drawing circles of, sort of increasing size. And if we're looking for the nearest neighbor, then we're looking for one historical observation, which is the closest to the one we've seen. So maybe I'll draw them in purple maybe. So if I have my circle here, and I said it was an actual circle, then clearly that, you know, the first I include there is a green and if I go if my circle goes a little larger, because it goes to that big, then there's an argument for saying, that, you know, in this bigger circle, I begin to include, let's say German points, rather than just French ones.
Right, okay, so that gives us a launching point to talk about K nearest neighbors, which I'll preview for now, but I wanna refer back to what we've just done for a second as well. So what is K nearest neighbors? Well, in the K nearest neighbors, the K is the number of neighbors we considering. That's a quote considered neighbors, all right. Now, so in the first circle, we have one neighbor we're considering. Suppose I just consider three neighbors, so maybe K equals three, odd shade K, but whatever. Now in this inner loop we have, let's say, one neighbor. So maybe in this outer loop, we have one, two, let's say three neighbors. So, okay so how do we resolve the problem with one neighbor?
Well, it's just whichever the label is in our database for that labor for that point. And for this other one, we have to kind of take a vote if you like. So these two points sort of vote green, this point votes black and green still wins. So for most K's here, so if I go three K's, five K's, 10k's, eleven, twelve. For most of those, we're gonna see this point be classified green. Of course, it comes a point. Now let's say I make my circle this big. So it is sort of goes all the way around the entire data set, then it won't be green anymore, because it will because the number of neighbors is all points. And so what will be the K here, suppose we say K equals all points? Well, we go a few red one, two, three, four, five, six red, one, two, three, four, five, six, seven green. One, two, three, four black. Let's make them some more red. So if I say there's one, two, three, four, five, six reds, it makes seven, eight reds.
Well, now if we let each every point vote on the color, red wins because there are eight reds. So we can have this behavior in which there are, in which the prediction changes based on the number of neighbors you're considering, which kind of makes intuitive sense in that, think about just what we're doing here. I'm saying, now go and find me the person who's closest to you and predict what they have. Okay, that seems very sensitive to whichever person happened randomly to be closest to you. But then if I keep considering more and more people, I'm averaging more and more information. And so there's some point maybe where we have a reasonable number of people we're considering. But then if I start considering more and more people, eventually I can see the entire country whenever everyone votes, right? And suddenly, all distinctions wash out, right? All distinctions wash out.
So that, you know where the optimal K is, is defined by the problem and by your data set. So there isn't a general sense of not being K. But there is this behavior or this, you know intuitively when most problems there is this sort of behavior around K where as we were smaller values mean, we can just have a bit more sensitive to randomness, no incidental characteristics of our data set. And as we get to bigger and bigger and bigger value here, as we go towards the number of values in our database, then the quality of our predictions drops because we're washing out all of this local distinctiveness. So that the concerns there are around whether local, and what I mean by local is sort of like, you know, if your unknown point is here, and you got some other points along this range, then maybe the local distinctiveness, distinctiveness, the local distinctiveness this might be more important than the global or it could be vice versa.
It could be actually that, you know, that this we've zoomed in quite a lot here. And actually, if we zoom out points over here, are actually still kind of relevant to understanding what's going on with points over there. So people who are in principle quite different from you may still share your love of films or something like that. So there's the question of, you know, how proximity in the feature affects proximity in the target. Are you, if you're, you know, you can be very different in your characteristics, your features, and yet still share the same target, you know, still have the same preferences for films still have the same price, still have the same conference in country, all right?
People are very different in their ages broke off from the same country. So you can be very distinct in age and still be part of the UK. All right now. In the, let's just wrap up with just considering very quickly here, the case of two features. And I just want to bring this out because where, because it has a slightly different visualizations. In the case of two features, you're gonna draw those features on the axes x one and x two. And they're both axes now. So let's have a look at we've got days from purchasing age as what we have here so we got say days.
And let's say this is age now. So the vertical here isn't the y-axis that's a tangent, there is no access for y. So it's just a number or label. So I'm gonna put draw that visualize that as color. And on the so okay, what does this look like in two dimensions? Well, we've got fraud or not fraud. Let's draw fraud in red. So maybe closer a few days from purchase young, let's say a lot of people over here commit fraud, some older people do commit fraud, maybe far away from purchase.
And young, maybe there are a couple of instances of fraud and maybe there's one or two, two up there as well. So you can you know, this is gonna be close to purchase. This is gonna be you know young. And that's you can see there at the corner that there is a kind of commonality or there's a sense of you know, what we see in our data set is that that region is where most of the fraud is happening. And the other regions are kind of empty. So let's look. So let's put not fraud on. So let's put not fraud everywhere sort of everywhere else, right? So, and occasionally, you know, that some claims will just, you know, some claims will just happen to be close to purchase. It's very plausible to buy insurance on a laptop and break laptop very quickly. And I've done that personally.
So you know, occasionally they'll be not fraud. Okay, so, right. Okay, now what now, what do we do here? Well, again if we have an unknown point, suppose that unknown point is here, well, if you consider one neighbor, maybe this is the closest. So if I just think about increasing the distance, and drawing a circle to include one neighbor, maybe I haven't positioned a single neighbor so helpfully but if there was a neighbor there say, you know, there's the circle would go here. And if I wanna consider you know, five neighbors, I just keep increasing until I hit five. So got one, two, three, four and then I got five middle.
So that would be maybe five neighbors there. And then then it goes on to vote right. Now if I put a point over here, again, we're gonna increase the circle, that's I want to include three neighbors, and maybe this is my, you know let's try and draw it symmetrically. So maybe that's, maybe that's my three neighbors to that point there approximately, and then that might be voted red. And of course, you know, some slight variations here, are gonna cause a big difference. So if I put my point, if I draw two shapes, this is a circle then there's a triangle here.
Now if my point is this triangle, my point is this triangle, probably I'll be voted red. If my point is the circle, and considering three neighbors, probably my point will be voted green. And my even if I include one red point, it's gonna be mostly green. Now, that' a little bit sort of unintuitive if you think through that a little bit, because what it means is that you know if this is, you know, seven days from purchase, and this is eight days of purchase, it means that one day can make a very significant difference to whether we predict you are committing fraud or not.
So we know what's going on there? Well, possibly for this problem, a K equals three might be the wrong K, you know, because it's too sensitive to these local changes in our input, maybe you know, so if we considered five neighbors, then maybe this would never this small difference here wouldn't produce such a large difference in the prediction. So you know if we consider five, then over here, we would go maybe one, two, three, four, five say, and this would still be red. So let's leave our discussion there.
And I think before we move on to thinks through this approach in more detail, it's important to have a bit of applied understanding of how this works. So we'll go through the Python demo and then come back to this topic, okay, thank you.
An Overview of Supervised Learning - Nearest Neighbours Algorithm - How the K Nearest Neighbours Algorithm Works - Hyper Parameters - Part 1 - Hyper Parameters - Part 2 - Hyper Parameters - Part 3 - Distance Functions and Similarity Measures - Logistic Regression - The Method and Workflow of Machine Learning, and Evaluation - The Train-Test Split
Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.
His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.