image
Hyper Parameters - Part 2
Hyper Parameters - Part 2
Difficulty
Beginner
Duration
1h 52m
Students
851
Ratings
3/5
Description

Supervised learning is a core part of machine learning, and something you’ll probably use quite a lot. This course is part two of the module on supervised learning. It takes a look at hyperparameters, distance functions, and similarity measures. We’ll wrap up the module with logistic regression, the method and workflow of machine learning and evaluation, and the train-test split.

Part one of supervised learning can be found here and introduces you to supervised learning and the nearest neighbors algorithm.

If you have any feedback relating to this course, please contact us at support@cloudacademy.com.

Transcript

The long and short of it is that we get three different models, right? So we get three F, put F three if you want, F four and F five. That come of out of these algorithms and the only difference is... And they're all basically remembering the entire data set, it's just which how many elements you choose before you take your mode. You find your most common choice. So the question is, you know, what's the right K if I'm choosing it, how do I know what's right? And that comes down to the data set and the problem. So let's think about a classification problem again, a couple of features. And let's explore, think through, what happens as we change or vary K. So, now let's draw these on. 

So we've got two features that we can do for it again. So this is the days since the policy was purchased and up top here we have the second one which may be the age of the user, or the customer I should say really here. And what we saw is that there was a fair amount of fraud here, say. Let's try and be somewhat precise, and most of everything else, say, wasn't fraud. Like I could be the setup we've got here. So the you know, If we imagine this is something that, you know, young people... So this is really about young people here. This is the region of people quite young. So this could be maybe the min policy age is 18, so maybe it's just 18 to 25 or something. Maybe it's university, maybe there's something, maybe this data set happens to be close to a university, and maybe it's a community where there's a university. And maybe around the university it's quite wealthy, say. And so, kind of circumstantially, it turns out that most of the fraud is committed by these people. But not really because young people, of the people who can afford, kind of coincidentally just because the data set happens to include non-wealthy young people because those non-wealthy young people are at university. 

So maybe it's a distinction between wealth but it's actually just captured here by age because of the way that the community is working. Right, okay, in any case, here's the data set. So let's see what happens when we think about think through K. So if I choose a K of... Let's choose an even number K first to make an initial point. Suppose I choose K of two or four or something of that kind. Here's the problem with those sorts of numbers. Now you can choose them, but here's the problem. Suppose I put a dot here and I say, well, consider two points, well, there isn't a consistent mode color, modal color, most common color, between those two points. There's one point red, one point green. So how do you resolve that? Well, typically algorithms will resolve it randomly. So they'll just, you know, pick a point. Now, that isn't good for a machine learning solution. We want we really would prefer the solution to be deterministic. 

So if I point a point there, every time I try and do the prediction, I want the outcome to be the same. We don't want randomness in out predictions. It's not helpful. The predictions might be guesses, but we don't them to be different every time we predict. So, you know, we might program it manually to always choose red or something, that could be something, you know, but or with to predict fraud into the words. Why would we do that? Well maybe if predicting fraud was a tip off to an investigator and, you know, if it couldn't resolve the, if it couldn't resolve the issue and it would equal numbers on both sides, then maybe we just say, actually, predict fraud because it's gonna be investigated by a person anyway. And maybe we want to over consider the number of cases of fraud. 

So that could be a reasonable solution. We go from an even number and then we always resolve to fraud. So that could be something. But, you know, possibly we actually choose an odd number and if we choose an odd number then there's no issue around voting or taking the mode because... You know, there is always, you know, you're always interested in three points so there's always gonna be an odd there's always gonna be a point which resolves the tension there. So if I put a point here, where three points are I don't know. So put a point here and maybe the circle goes, includes these three points and we have, you know, a red and a green but the green breaks the tie and so this gets colored green. So you could think of that as getting it's predicted green. 

Right, okay, so typically speaking we gonna prefer odd numbers. It's not a necessary rule but typically we'll prefer odd numbers. And let's just explore how odd numbers are going to work here. So let's do three let's do three, let's go let's then skip over to maybe 21 and then consider large and larger numbers. So if I consider three numbers, what I'm going to do to kind of think through how K works, is when you predict and predict a color, a category, whether the person has committed fraud for every possible pair of features. Now, every possible pair has an infinite number, alright? So because these are real numbers, I could have an age of, you know, 18.1 and zero days, 18.101 and zero days. That's a very large number of numbers. 

So maybe rather than every pair of inputs, what I do is I just chose granularity. So I just say, well, 18.1 and 18.2, zero, zero, and then I go through all the 18s. And then I go again, 18.1 now is, let's say every half a day is my granularity in days So I do 1.0, 1.5 and they give me a massive amount of data. So I generate a dataset that looks like this, you see, right with every kind of every possible point, within some approximation, and then when I ask my solution, my algorithm, which gives me my model, so I'm gonna get and run this process here, and then I'm gonna try out this model, I'm gonna say, okay, you run through all of these test points, it runs through all of these hypothetical cases, and tell me what kind of person this is, is this a person committing fraud or not. 

So let's do, let's sort of let's give that kind of go. So if I put on some purple points here, you know, if I put any point I put on in this region, is gonna be colored red. And you can see that because if I put a point here, the three closest is gonna be red. And that's going to let's say that, so if you follow that line of reasoning, when the points stop becoming red around here, right, so that's going to be green because there are two green points. If I put a point anywhere around here, these ones will be red, it will start to go green around, maybe this will be green because there are three points here, you see So if we if rather than , drawing all these thousands or millions of points on the graph why don't we just color the region where the points are going to be predicted red, then maybe it's sort of that region there. And there'll be a bit rough, so that's going to be that's going to be some red here, red there, maybe red here. Oop, I've gone too far, a little bit, but maybe we'll put some red points on then that sort of justifies it. And then if I color again, where the green points are going to be, so let's do the green. 

So, so that's kind of the picture that emerges. So we you know, another way of visualizing that is sort of drawing the boundary. So if we draw a boundary between where red and green are, so there's the boundary. And you can see here that it's sort of got this sort of sneaking feel to it. And what why is that? Well, because if I, if I consider a point here, suppose I put it in purple again, if I put a point here, I get I do get green. If I put a point here, actually, I get red. And so this is actually quite rough, because there are small variations, you know, if I go here, I get green. If I go there, I get red because there are more red points closer to here. 

Okay, so okay, fine. So this is sort of what happens with K equals three. And the reason the three is important here, will become clear in just a second. So that's k equals three. So let's increase K and let's see what happens. Okay, so let's think through how increasing K will change this picture. Well, let's have a look at some of the problematic rough areas. So, if we zoom in just a little bit here, you know, we can see that if I were to increase this increase K quite a bit here, maybe my point here would be green rather than red. So why is it red here? Why it's red there, because of the red point here. Suppose there are some red points all around here. 

And then I know, so there's at least three red points in this region and say, you know, if I draw them So those are the three red points. Now, you know, if I increase K to say 21 or 51 or something, so I'm gonna consider a much larger region with 21 points. So let's draw that in purple. So, you know, maybe, maybe the region is this region here. And in this region that supposed this just, you know, there is 21 points, and most of those are going to be green. So what's going to happen to this region of red as I increase K is going to disappear and become green. So this, this whole region will become green, because there's not enough red points there to survive a cake or 21. 

So we get this all being green. And so if we look at what's happening to this boundary, it's becoming smaller because the red is more clustered in this area, becoming tighter around the red. And because the green is more, because the green is more diffused here and spread over a larger region, we're getting, you know, we will get this red boundary nice and smooth and rounded in this corner, because so the local variations are not gonna cause the kings. And why do we have a king here was because of the red point there and a green point there. But if we're considering much larger regions, then the prediction won't change here and here. It will change much more gradually along the edge. So it's kind redraw quickly redraw the sketch for a high K, what we'd say the high K is more something like, you know, a smooth, red, and everything else would be green. 

So within, in this region here, we get a nice smooth red boundary. The edges is fairly smooth, and then the green boundary would be everywhere else. And again, it will be smooth along the edges, so, now the question is, you know, what solution is the better one? Do we want this rough boundary in our solution or not? And we there is no correct answer to this question really, there's no true answer in the sense that, you know, the question is what, you know, the scientific question here really, which is the investigative question is, why is this boundary, why is this irregular boundary here at all? You know, and then the answer that question can help us a bit. So if there are some red points in this region for good, predictable reasons, reliable reasons, then maybe we keep it. So what does that mean? Let's go, let's think about the problem, we have to think about the problem here because the problem determines whether this is the right thing to include or not. So let's think through that.

Lectures

An Overview of Supervised Learning - Nearest Neighbours Algorithm - Nearest Neighbours Algorithm for Classification - How the K Nearest Neighbours Algorithm Works - Hyper Parameters - Part 1 - Hyper Parameters - Part 3 - Distance Functions and Similarity Measures - Logistic Regression - The Method and Workflow of Machine Learning, and Evaluation - The Train-Test Split

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.