The course is part of these learning paths
Supervised learning is a core part of machine learning, and something you’ll probably use quite a lot. This course is part two of the module on supervised learning. It takes a look at hyperparameters, distance functions, and similarity measures. We’ll wrap up the module with logistic regression, the method and workflow of machine learning and evaluation, and the train-test split.
Part one of supervised learning can be found here and introduces you to supervised learning and the nearest neighbors algorithm.
If you have any feedback relating to this course, please contact us at support@cloudacademy.com.
So now that we've learned about a couple of algorithms, linear regression, K nearest neighbors, both capable of regression say, we've learnt enough now I think to evolve our understanding of the method and workflow behind machine learning, and introduce or expand upon the approach around evaluation. So here, so let's just do a recap first, let's do a recap of what we're doing. So, okay so what's workflow? Workflow, very hallow at the moment, we obtain data somehow, right.
So, maybe we say capital D is our data set and we obtain some Xs and some Ys, fine. Next step, what are we doing next step? Well, we put this data set into whatever approach we're using. Maybe, if that approach has some hyperparameters, so we can try lots of values of K here so maybe with this approach we try K equals three, K equals five, so A you know with the entire data set K equals three, you know the algorithm with the entire data set K equals five, however we want to you know, expand on that. And, you know, for each of these approaches we get a model, don't we?
So we get, you know, an f hat let's say three, an f hat for five. And, you know, maybe one of these is the best so let's put a little star on the best. That turns out to be the best. Now, you know but you know, how do we put that star on it? How do we know what something is the best? We need to think about how we're gonna evaluate a model, right, how do we evaluate a model? So that's the question here right? Evaluation. Right, so let's explain the problem first and then we'll come to the solution of how do we evaluate a model.
So, evaluating a model. So here's one thing we could do, here's one thing we could do. We could take all of the data we already know about, all of the Xs and all of the Ys the historical data set, and we could use the algorithm, well more precisely we can use the model, to predict all of the Ys for all of the Xs, where we know what the Ys were already. So that's, show you what I mean by that. Let's look at a database table maybe, so we've got X and we've got Y. Let's say this is, what other industries and things are there? We've got retail, finance, health, sports, media, internet, let's go for the internet to be able to predict hits based on maybe some kind of marketing score we have.
So the idea here is maybe that we have pages of different SEO scores. So maybe this is a search engine optimization score for a page, how well that page is optimized. So we've maybe got some expert in to give us a score, he says, you know "This page is nine out of 10" or "99 out of 10" or something. Or maybe on another system to look at that. Maybe we use the ranking in the Google search results for the relevant term so if you come 10th, we give you a score of 10, or something like that. And then, we connect that to how many hits our page has. So, okay that's an interesting little example. So, let's put some values in here. Let's say we've got a page let's say we've got a page that has a score of nine, eight, seven, seven, eight number of hits.
Let's say a higher score is better so, number of hits here, say we've got 1,000. You know, 300, 1500, well, let's try and make this line up a little better than that so, if a score is good I'll I have more results, let's say 1300 to be a little suspicious but if it's seven should be less so we should have like 500 for seven, 400 for seven, and then for eight there maybe we say that we have a little less that nine so let's go for 800. Okay, so, those are some points. Now, so what we could do with our approach is obtain a model and then use those Xs to predict those Ys. Now we know what the Ys are but we can do this anyway so suppose I, you know, we could use linear regression for this. Let's say my algorithm is just standard linear regression, I put in my data set, in this case my data set is just one X and one Y, so I put my data set in. There's no important hyperparameters here. With any of these approaches you take, there's always things you can fiddle with but let's just say there's no important hyperparameters here.
Now you're gonna give us a model. You know, gives us you know, f hat. And let's say our f hat here is, just in terms of X, and it's gonna be well, so let's divide by let's times X by 100 and then add a little bit so let's just do that so if you say 100 times X plus 10 say, that could be the model. Now of course, a little review here, what the algorithm's actually doing here is it's just fixing 10 and 100 so that in the Python approach, it's gonna get 10 and 100 because the formula is just off the shelf standard def or something like that, right? So, now how can we evaluate this? Well what we could do as I said, we could try and predict for the things we already know so we put in nine, we get 910. So these are now our prediction column. We put in eight, we get 810. Not doing so well. Which is easy to, easy for me to write anyway, 810. So, okay, and then we could get the total error for all of these predictions couldn't we?
So if we talk about loss, the loss here would be the square difference. So it would be the prediction minus specification squared, it doesn't matter what order that is because it's ignoring the sign. But I'm not gonna give the precise loss here but let's just say, if we give the square root of the loss for the sake of me writing it out easily, that's gonna be, you know it's gonna be about 100, this is gonna be about 500, that's going to be about 200, this is gonna be about 300, and that's gonna be about 10. So we're doing pretty well on that one because we're just doing differences basically. So okay what we could use is the total of this column, the total loss, and that could be our error. So we could use the total loss as the evaluation or the performance.
So the total loss here is about, you know 1100. Right, so, okay so what does 1100 mean can we interpret that? Well that's the square root of the loss by the way so that is interpretable. What that means is it's really just sort of this minus that so if these are hits what that means, let's take the average of this then so if these are hits then the units of the column is hits so that's hits and all these Y things are hits. So, this is 1100 so if we take the average of that so we've got one, two, three, four, five so divide by five, that's about 200. So what that says is, one average we're 200 off from the data set we've already seen. Right okay so that's something okay, so you know, what we could do here, we could say here's our prediction function.
You know and what we could say is, well, we expect is to be plus or minus 200 hits from the truth. Here's a problem with that approach though. So there's several problems . One problem is it seems unlikely to be a good estimation of the error we will get when we come to deploy the model. That's what we're trying to get here when we're trying to evaluate a model I don't really want to know how good it is on the things I've already seen. Who cares about that? I actually know what the Y is, I don't need to predict it, so the predicting a Y here is completely irrelevant for the purposes of solving a problem, there's no problem here at all.
So well I don't care how well I've done in the things I've already seen, I wanna know how well I'm going to do on the things I haven't seen. So, really I'm giving this algorithm, well I'm giving this model anyway, but the approach in general, giving this, approach this model, I'm giving this the best opportunity it can have predicting things it's already seen, and you know, this error is just like a fantasy error right? I mean, I'm certainly gonna be much worse off than 200. I'm gonna be much further off than that because the algorithm's trying to predict for a region it hasn't seen. So I'll show you what I mean by that. Let's try and visualize this. So if these are, if the historical data is in black here, let's say that's the historical data, and then the model we'll do in blue this time.
So there's the model. Then, what are we actually doing with these predictions? Well these predictions are gonna be along this line, well maybe I should have done the model in red, I don't know. But the predictions are along this line here. So those are our predictions, and all we're really saying is, look, you know, compared to this historical data, we're plus or minus 200. So, we might be, so on average we're within 200, so that's let's say plus 200, minus 200, and most points are falling within that range. Fine but what if, in the deployment phase when we come to predict stuff, there's points over here. Which there probably will be, you know, to some degree. You know, we're likely to see probably likely to see more variation in the out sample, in the data we haven't seen than in the in sample. It depends on how we've collected the data and where we're deploying it.
So, you know, if we collect every possible image of every possible dog in the entire world, then its unlikely we would see greater variation in the dogs that we haven't seen than in the dogs we have seen. But if we collect, let's say, data about a UK supermarket, just, all in the UK, when we come into deploying in France, say the behavior in France might be different and therefore, actually there could be things that we haven't seen, points that are quite distinct than the ones we've seen before. So, we should expect when we're coming to predict stuff, that the data sets there that the observations we're making are things that are probably a little bit different than the observations we've already made, so we need some way of trying to get at that.
So, let's just summarize that then. So problems with this approach, so problems. Problem number one: you know, that's way too good. So what we haven't done here is accounted for the variation in the out sample being greater than what we've seen. So how do we wanna say that? Maybe we say that, variation in future, sort of may be unaccounted for. Variation in future kind of not accounted for. So problem number one. So, you know, can we do better than that? Now, problem number two is a more severe problem. And that could best illustrated with K nearest neighbors. Now, here's an interesting, here's an interesting set up for K nearest neighbors. What I'm going to do is choose K equals one. K equals one.
So maybe we use the same data set as above. Okay let's just use this data set. And I say, let's predict let's predict a Y hat this is gonna be with K nearest neighbors and the K here is gonna be one. And I'm gonna ask the model, you know whatever, to take that approach and predict, predict the Y hat, for things it's already seen. So what is it going to do? Well, I put this X in, and it will just look up in the database, the closest X which is gonna be this one. And it predicts 1,000. Maybe you can start to see the problem here. You know, it finds eight, there's a chance it will predict 800 or 1300 but let's just say it chooses the 1300. Seven again, maybe let's say in the case of seven, it goes the other way.
So, it's gonna try and find the closest one. Both of these are the closest to seven, so maybe it predicts 400 for this one and 500 for that one. Maybe it doesn't, maybe it actually just predicts the same. And here is the problem, there's no error. No error. Why? Because what K equals one does when we're asking it to predict things it's already seen, is it just finds the thing it's already seen. So what's problem number two? We don't, you know. What's problem number two is, you know, it's usually possible for an approach we take here, an algorithm somehow to produce a model that remember all the data. So, you know, algorithms can, often tune, let's say, models to remember all, input.
So we can't use the input to predict we can't use that same input to estimate the error because it knows everything. So if it knows everything, so we can often found ourselves in a situation where it appears we have an amazing solution with perfect performance, perfect accuracy or whatever, and actually it's just telling us things that we already know. And when we come to use it again in the future, actually we're we're gonna go completely wrong.
Lectures
An Overview of Supervised Learning - Nearest Neighbours Algorithm - Nearest Neighbours Algorithm for Classification - How the K Nearest Neighbours Algorithm Works - Hyper Parameters - Part 1 - Hyper Parameters - Part 2 - Hyper Parameters - Part 3 - Distance Functions and Similarity Measures - Logistic Regression - The Train-Test Split
Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.
His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.