Cross-Validation - Part 1
Start course
1h 52m

Supervised learning is a core part of machine learning, and something you’ll probably use quite a lot. This course is part two of the module on supervised learning. It takes a look at hyperparameters, distance functions, and similarity measures. We’ll wrap up the module with logistic regression, the method and workflow of machine learning and evaluation, and the train-test split.

Part one of supervised learning can be found here and introduces you to supervised learning and the nearest neighbors algorithm.

If you have any feedback relating to this course, please contact us at


We have looked at various considerations that come into why you would choose between one algorithmic approach versus another, kind of data that's available, the prediction task that you have, the quality of the model. I wanna press on the third one in more detail. In other words, is it possible to be more systematic, systematic at all, by comparing the corridor to quality of predictive models that various algorithms will produce and thereby select the best out of them for the job, given some performance criteria. And the answer is yes. 

The technique here is called cross validation. I explain cross, just a second. But validation here is the idea of justifying or validating the algorithmic choice we have made. To do this. I think we need to look at some problems and it can occur in the training of models and in the production models or training with data. And think through how performance measures can be a guide to how well we are doing. So let's start that. Let's start that. Let's take a step back and let's just choose a specific algorithm, especially approach some specific data and work through the lines of justification and analysis. So here let's go for classification task, two features and we want to distinguish that say fraud from not fraud. 

So, two features, may be feature one is, is this dataset, is about insurance claims, feature one is days since purchase of the insurance agreement and feature two here could be the age of the claimant the amount of the claim. Let's go for amount in let's say pounds, right. Now we will have in this feature space is a Y column a historical Y whether they're committed for or not. And let's do that in red and green. So red for people who have committed fraud, green for people who haven't. So people who probably have committed fraud or people who are sort of around zero days since purchase. If you've just purchased your insurance agreement and you make your claim, it's likely that it isn't like a certain probability that it isn't valid. 

So let's go for fraud down here. So low amount, early days, high amount, early days, let's go in green longterm, low amount, very unlikely to be fraud right? If you purchased an insurance agreement years ago and you're claiming, let's say 50 pounds on the claim, it you know, maybe it's a vacuum. I don't know. Let's put in here again, just probably some validity there. Of course you have a bit of blending in between right? So it's gonna be some fraud going on all across the space and it's going to be some genuine claims everywhere as well. Now, the idea is, can we distinguish between these, can we distinguish between them. When a naive straight line approach would draw a line? I don't know here, say, fraud, not fraud, rather than do that. 

Let's use the K nearest neighbors awkward. Right? So recall that K nearest neighbors. What we're gonna do is for our prediction, for a point, that's really pointed here in purple. What we'll do is we'll choose K nearest neighbors in a historical data set and predict for our unknown point, what their class was what their label was, whether they committed fraud. So if, if my, if my unknown customer is in purple here, what I will do is I'll go okay, well, you know there's one barefoot over here and there's two non fraudulent claims over there. Suppose I say, K equals three. What I would do is I would label this point here green. I would predict not committing fraud. 

Let's take that approach. And let's say here K equals one, let's start with K equals one. So we're gonna do here is for this point we wanna find them the closest neighborhood in historical dataset and predict for our unknown point, what they did, right. Now, how do we assess the quality of the model that we get out of this? So remember when K nearest neighbors as the model basically is sort of just the historical data set. So the predictive model F, F of X is really just, you put in on of our Y historical data into the model, and what we're doing is write in English first, maybe we're just saying select Y for which the unknown X is most like, or similar to one of the historical X's in more mathematical formalism, we would say, minimize the argument, distance in the historic X from our column X. This is the sort of distance to similarity measure basically. And then return the Y for which that distance has minimal, to minimize this distance and then return the Y which that distance has been English in the above. So that's basically the model. How good is this model? So this is just one point and we're looking out for the one closest to K. 

So how, how good is this model? Well, here's a problem. Here's a problem in assessing the quality. Suppose I take my historical dataset and I train with that data set. So I put that data set into the algorithm. We could notate I suppose, like this and that comes a model. Yeah. That's the training phase. Now let's see what problems occur. If I try to use a data set in validating or verifying the quality of its solution, but what do I do? How do I verify the quality? Why we need to come up with some performance measure, let's call this performance. 

Now this is domain specific, meaning that it could be a profit loss measure. Meaning you have a complicated formula that says for certain, if you get the fraud rate and it's a big claim, you should consider that to be great performance. If you get the fraud right. And it's a low claim don't even bother getting that kinda of fraud right? We don't care about 10 pounds, 20 pound claims. If they are fraudulent, we don't really care too much. We care about getting a big claims right? So it can be this very specific balanced measure that tries to maximize the profitability of the business. We're going to do his reducing naive and just say, maybe our performance is number of times we get it right out of the total. So accuracy basically. 

So you could use accuracy, naive measure. Basically we say maybe fraud. Number of times we say fraud out of total. That's the accuracy for, you know, for saying, for get the ID right? So if that's maybe 99% or something. Now, what we're gonna do is try to compute the accuracy on our model. How do we do that? Well, if we feed our historical data set into the model, so this is now predicting, we get out our estimates for Y which in this case is the fraud or not fraud. Now, if now his issue, if K equals one, that is to say if the algorithm is just to look into the historical dataset find the data point in terms of the features and report it's Y whether it committed fraud or not, what's gonna happen when you put the same dataset to test the All we're just going to do is look up the historical point that you're getting it and report the very same points, label, very clear about this. Let's got to the visual. 

The historical data set here is labeled on the screen. If I just put in this point to be predicted the algorithm already remembers that point. So what is going to do is look it up and report it's color, whether or not he committed fraud. And so the accuracy of a K nearest neighbors, K equals one, the accuracy will be 100%. So the score or the performance we will get will be 100% because the algorithm cannot make a mistake on historical dataset because the algorithm is just looking up that historical dataset. 

You'll get this same data set that's already seen. They can't make a mistake, a hundred percent accuracy. Now that's a catastrophe for us because what we want, is some kind of realistic measure of how well we're going to do, is K equals one, a good choice of a model. Should we be using a different model or should be using any regression, logistic regression, neural networks, whatever it may be, should we be using another model? Will appear is not because we've got a hundred percent, but clearly a hundred percent is right. 

When we come to use this with a dataset that it hasn't seen before that accuracy is going to drop way low. And we have no idea how low it's going to get. So what do we need to do? What we're going to do is perform a split on data, but we need to take care and understanding how that spending process is going to work. What we have seen at the outset, is that methodologically, what we did the beginning of the project once we prepared the data and cleaned it and so on, is split the historical dataset into a test set and a training set. And if you recall, the role the test set plays, is as a last unseen dataset which can give us an estimate of the out sample performance. 

What does that mean? It means that once we've arrived at a model that we are going to deploy that we are going to use once we've arrived at the best possible model we still need a dataset to give us a sense of how this model is going to perform in the real world. Now his problem, if we use that test set in the choice, to choose between the algorithms we're using then that tests that will have been seen. So we can't use that test set to choose between approaches should be drawn to them, should be two models. That test that has to be there at the end of the project unseen. 

So what we're you going to do is split the training set, we're going split it into what I would call a deterministic subsets. Let's say, just do it short. I mean by that, but basically subsets, let's say five subsets. Doesn't matter how many times you split it, but let's say for the sake of argument to split it five ways. So we'll have a training it's tier off and out tier one, tier two, tier three, and so tier five. Now each of these subsets will contain 20% of our training set. And what we are going to do is try the model on each of those subsets and report the accuracy just on that subset. So let's be clear in this first set, we will have 80% of the data in for training, and we will have 20% in for testing. 

So testing there a value we call validating that just like to say computing and performance. So you will compute the performance on the 20%. It hasn't seen and show it 80%. And then we will sweep through choose a different 80%, choose a different 80%, choose a different 80%, choose a different 80% until the algorithm has seen all of the data through in training. And then every data point has been in invalidating. So let's walk through how that works. I think before we walk through how that works and I'll show you the advantage and disadvantage of the approach. 

So the summary is this, what we are looking for is a method and approach a heuristic that is analytical that gives us numerical results. That gives us a way of choosing between possible approaches models alogarithims. We need a way of doing it. What we're gonna do is when we take our training dataset which has already been split off, and then especially lots of different ways showing and hiding 80 20. And then we're gonna look at the performance across those different splits of our training dataset as a guide to which algorithm or approach will be the best. So now we'll move on from about how we do that.

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.