Cross-Validation - Part 3
Start course
1h 52m

Supervised learning is a core part of machine learning, and something you’ll probably use quite a lot. This course is part two of the module on supervised learning. It takes a look at hyperparameters, distance functions, and similarity measures. We’ll wrap up the module with logistic regression, the method and workflow of machine learning and evaluation, and the train-test split.

Part one of supervised learning can be found here and introduces you to supervised learning and the nearest neighbors algorithm.

If you have any feedback relating to this course, please contact us at


We've seen the technique of cross validation, which is this sweeping through of the data set 20% of the time. Training on one part, verifying on the other. And the idea is that by doing that you will be able to get a good estimate of how the model generalizes from the training data set, regardless of which part of that data set you choose. So if I choose low part, then it generalizes the high part. If I choose the high part then it generalizes the low part. It's giving us this sense of genuine generalization, not this artifact generalization. 

This illusionary generalization, which you get from over-fitting the data. If you overfit the data to the random variations in the original set, then you, yes, you perfectly captured the original set, but the model doesn't generalize. It's accurate to your history, but it's inaccurate to your future. So validation is this process of capturing how well the data's going to generalize. What I would like to do now is just fit that into the workflow. What is the workflow? Well, let's have a look. Workflow. 

Coming in here is the historical dataset into our supervised learning problem. And we're going to split that into a testing set and a training set. And the idea behind that is that the testing set is gonna be this always unseen data set that allows us to have a genuine sense, well, best as we can, estimate of how the model will perform in the real world, all right. Now what we do with our training set is we then split this. Now the original set is split more or less randomly. In other words, in the tain test split you don't just go, "We'll I'll have the first 20 rows in my test set." Because the first 20 rows might have something in common with them. 

And you want the test set to be a good representation of your training set, sample randomly, but a good representation of it. With the validation approach, we're gonna be deterministic: 20, 20, 20, 20, and sweep through the whole thing. That's the idea. So with this one, we will have a training and validation. Here are val 80, 20, and we'll sweep through five ways, you see. Now out of this split- Well, in each of these, what we get, we put these into the algorithm and we get a different model in each case. Algorithm one and algorithm two, algorithm three. Different happy premises, let's say different K different end different outcomes, different models. 

Now the idea is that we will choose the best one by optimizing for the validation score, the average validation score. So we'll choose one of the best, and this will be the best model. I'd say that my best model here is three. Now. Here's why here's where we take a step back. What we were doing here with this validation set is not actually trying to find the perfect model. We were trying to find the best algorithm, and here's why, because once we finish this process, we will take the very same algorithm. Let's say it was Kenya's neighbors. Let's say it was a quadratic- a quadratic regression, let's say it was neural network. And what we do is we retrain the algorithm on the training set on the train. 

So, and only spit of it on the whole training set, and then verify it with the test set. So we will, we will go, okay. Three was the right approach. Let's now put in the training set, the whole training set into three, and arrive at the best model. We can call that star with star here meaning best. And then we will compare that model with the test set. In other words, we'll put the test that into the model. We will get out of this, our test Ys. So estimates for our test Ys, compare them with our known test Ys. 

And this comparison here will give us our general best case performance. Alright, so what we've got here now is a model which has been trained on all of the test data and then tested on the test set and that test measure the performance measure for the test set is then the final measure we will use for gauging the quality of that model. Now we could end there. We could say, look, F star is our best model. Typically we do not though, because at the beginning, the training data set was 80%. And the tests that were say 20%, even you can choose 70, 30, whatever, whatever you choose, let's say 80 - 20. And since the training set was 80%, the modeling process, the production process, the algorithmic process has not seen a 100% of the data. 

It's only seen 80% so that we could verify it only 20% we're going to do now, then before we deploy. So before we deploy, think about it before we actually use this thing is we will retrain this algorithm on the entire historical dataset to produce a final model, which now basically has no quality measure against it. But hopefully it's a little bit better than the partially trained model, which we could have a quality measure on it using the test set. But the idea behind having this final model, which has seen everything is if you show it a hundred percent of the data, it will, it will do a little better than if you're going at 80% of the data. 

But, and if you recall, the only reason we hid 20% was so that we'd have some test performance measure, anyway, there was no sense in deploying a partial train model. So before we deploy, we will deploy a fully trained model, but we will report as the quality score it's score on its partial training and hope that it will do a little better. Now, in any case, the performance we ever report is a usually optimistic number, but it's the say in a real world deployment, the real world is always a little bit more different than our training set can simulate. 

So even by hiding 20%, we are probably not getting the distinctive differences, the distinctive ways, the real world will vary from what we have seen. So with the cross validation approach, even you're still getting an optimistic picture of how performance will work in the real world. Things are likely to change things, things are likely to be a little bit different than you thought, and the accuracy will drop. The performance measure will drop. So all we are really doing anyway is reporting something and be optimistic. But the key idea here is that we will retrain the model first to test it on the training dataset and then retrain them on everything before we deploy.

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.