The Train-Test Split
Start course
1h 52m

Supervised learning is a core part of machine learning, and something you’ll probably use quite a lot. This course is part two of the module on supervised learning. It takes a look at hyperparameters, distance functions, and similarity measures. We’ll wrap up the module with logistic regression, the method and workflow of machine learning and evaluation, and the train-test split.

Part one of supervised learning can be found here and introduces you to supervised learning and the nearest neighbors algorithm.

If you have any feedback relating to this course, please contact us at


So to solve these problems, we change our initial step in our approach. So if you recall, step one is just to take out the data that we've learned over here and put it in some algorithm or some approach. We're gonna make a modification here. So let's have a look at that. What we're gonna do is not show it all the data. Okay, so maybe you can see already how that's going to help. So we've taken our data set, x and y. What we're gonna do here is just split this into two data sets, and let's say a data set for training, and a data set for testing, "testing", maybe, we'll call it. And only with the training data set will we actually show, that's all we will show to the algorithm. 

Never sees the testing. Testing goes off in another direction. And to explain what's gonna happen here. So with the training set, we will obtain a model and then with the testing set we will ask this model to predict everything in the testing set. Things that we already know, but because the algorithm hasn't seen it before, it isn't susceptible to the "I've remembered everything error" or say, accident or something, and it's going to... So it's as if it's predicting in the real deployment phase, in the phase where we know the features, but we don't know the target 'cause it hasn't seen the target, hasn't trained on it. So we're gonna solve that problem, and if we choose the testing set well, maybe if we choose it at random, then maybe we can capture a bulk of variation in that testing set, which isn't there in the... Which the algorithm then can sort of give us a better estimate for, so if the algorithm is trying to predict for unseen things, that's gonna be a more realistic error than when it's predicting for things it's already seen. 

Okay, so we split this, and maybe we say we take 20% on this side, and 80% on this side, possibly at random. So now the splitting approach here, this splitting approach is a practitioner thing. So really, this whole thing is part of the approach now, you can think of this as a phase of the approach. And it comes down to some considerations as to the data set and your problem, so if you have a lot of data, maybe you can say, "Well, I'll just test on 20%." If you don't have a lot of data, you may need to increase your percentage so that the amount of data that you have in your test set kinda needs to be enough that you're getting a high quality evaluation. 

Now, it doesn't really, in a sense, it may not matter too much what this split is because sort of to preview the end of the process, that when we're done, we will actually retrain the model, so we will actually produce a different model than the one we're gonna evaluate on, and that one we will show the entire data set to. So there's this intermediate phase where the model has been tuned with a smaller amount of data, and that's so we can evaluate it, but since we have all of this extra data, we're actually just gonna retrain it so that it can be better informed, but we won't have an evaluation score on the one that's seen everything, we just sort of hope that it's come out better, and we have the evaluation score, which it probably has, if it's seen more data, but we have that evaluation score on the less well trained one, but it still gives us some insight into how things are gonna perform in the future. 

Right, okay, so step one there , could call that step two for splitting. And then we've got step three here for find the model. That's probably a step five and step four there would be evaluating the model. So how do we evaluate the model? So this comes out with a model, an f hat of x with whatever parameters and things, then we put the testing set into that model, and then we score on that. Score or evaluate. And we can use the same formula if we want that we were using for our last, so we could do a square error formula or something like that, or an accuracy formula for classification. We're gonna go into the detail around scoring in the section on model selection in full, and in particular in statistics as well, but for now I wanna give us a general workflow around this rather than the detail behind each step. 

So okay, so what does this look like in terms of data sets? Well, you know, it's a relatively simple difference we're making. So if this is our incoming data set, and here are our entries. Well, you know, 80% will be for training, 20% will be for testing and then we predict only on the 20%. And then if we have some scoring system, we won't call it loss, 'cause loss is a feature of training. It's how the algorithm will tune the parameters, so the loss is where you vary the parameters, giving the whole new dataset fixed and finding that best model. We'll call it a score because we're not varying any parameters or anything anymore, we're not computing loss on a model. 

We're just evaluating it, it can be the same formula but it's a very different idea and different aspect of the approach where we're not changing anything. So then we're just gonna score on these entries here. So if this was just y minus y, da, da, da, da, da, so we could square it, and we take the total, and then maybe the square root of that total, just like how we did just before, the square of that total can be the score that we use. Right, so a few remarks before we look at the practical side of this. One remark is that, you know, when you make this split between training and testing, the important thing is, and it perhaps hasn't been emphasized enough in this diagram, the important thing is that there's a very big wall between your training side of things and your testing side of things. So maybe let's make that as clear as we can. 

So if the data's coming in, then you do your what's called a "test train split" or a "train test split" then really everything that happens in this side of the thing should continue as usual, and then testing only comes in when you're done. So let me just maybe say, let's say here we could say, "done" on all of our approaches and all of that sort of thing. When we have what we think is the best approach, the best model, only then do we actually come and test it, and the point there is that we must ensure that the algorithms we use have never seen the test data, that's very, very important 'cause if they see any aspect of it, then we're gonna have real trouble getting a reliable estimate for that out-sample error. 

So the goal here is to get that out-sample error, right. So it's that sense of "well, there's the in that we have seen, "there's the out which we have yet to see", and we somehow need to estimate that, right. And we're not gonna get a good estimate of that if there's any sense in which this has seen this before. So the general approach here might be to take this split, save the test data to a completely different area, either a different table in the database or a different file completely at the very beginning of the project, at the very, very of the beginning project, and then only at the end of all the training activity do you come back and take that test data and then you use it. 

So that test data goes and can't be used at the very beginning, so that you're quite certain that when you come to evaluate, you're right. Now of course, when you are done, so when the evaluation is out of the way, so you know, you've gone down here and you've come back and you've evaluated, at this point, we move on to talk about deployment. And between evaluation and deployment, so between the evaluation step and the deployment step, you're going to take your best approach and then show it everything and then use that model to predict. Alright, so you're gonna redo that. So there's a few little moving parts here, and I think this is mostly a practical conversation. 

So hopefully, when you see the practical side of this, we can look at the steps in detail, and we are gonna come back to concerns around the workflow and the workflow approach in a whole section on that. So what I wanna bring out now, in this supervised learning section is just this idea that there is this split occurring. There's this evaluation step occurring, and there's this workflow around taking data, splitting data, training on data, and then testing with data that has to occur as part of the solution to the problem so that we have a good estimate for what the error is.


An Overview of Supervised Learning - Nearest Neighbours Algorithm - Nearest Neighbours Algorithm for Classification - How the K Nearest Neighbours Algorithm Works - Hyper Parameters - Part 1 - Hyper Parameters - Part 2 - Hyper Parameters - Part 3 - Distance Functions and Similarity Measures - Logistic Regression - The Method and Workflow of Machine Learning, and Evaluation

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.