Supervised learning is a core part of machine learning, and something you’ll probably use quite a lot. This course is part two of the module on supervised learning. It takes a look at hyperparameters, distance functions, and similarity measures. We’ll wrap up the module with logistic regression, the method and workflow of machine learning and evaluation, and the train-test split.
Part one of supervised learning can be found here and introduces you to supervised learning and the nearest neighbors algorithm.
If you have any feedback relating to this course, please contact us at support@cloudacademy.com.
We have just seen an example of the need for validation in the k-nearest neighbors K equals one approach I showed you. What we saw there is that with a K equals one solution, you can't really use the historical dataset to assess a performance, because it remembers the historical data set, performance will be a 100%, that's not a helpful number.
We need an accurate understanding of how well it's going to do. And that phenomenon, is actually quite common in machine learning algorithms, that many algorithms have the dial, this hyperparameter like K, which you can adjust to increase the complexity of the model, increase its capacity to remember and thereby diminish your ability to understand how it's going to perform on the out something, in the real world. What does that mean? It means effectively, you can jerry rig machine learning algorithms to remember data and therefore if you use data the algorithms have seen, you get a false sense of confidence because they have just remembered what they've seen. So we need this unseen dataset. With the validation set, thats gonna allow us to choose between different approaches.
Now, the K nearest neighbors example was pretty good at demonstrating that point. Where I think we should move to a regression problem just to get clear on lots of the benefits of the cross validation approach. So let's do that. So with the regression problem, here let's choose an infectious disease problem and on the horizontal video T which is days since first 100 cases and on the vertical here we have the frequency of the, or count of the number of cases. So while here what we're trying to do with this problem is predict of course, number of cases at a particular point in time, T, which is our future. Okay so, let's look at dataset. Here is a dataset, it's a real world looking.
Now the question is, what is the right curve for this? What is the right curve? And it is different choices of curve, which are the different choices of approach that we need to choose between. So we're gonna use cross validation, this technical, methodological, numerical approach, to give us a sense of how a choice model is gonna improve our performance delivering our product objectives. So, what I can do here is I can choose naively a linear approach, straight line, or I can go for quadratic, or I can go for, see if I can draw this for you, cubic, and look like that. Or I can go for quartic, which will be a steeper U shape I should think. Or let me go all the way up to, oh I don't know, X to the power 31, X power 16 or something.
I mean, X to power 16 in black here, it looks something like not quite like that, something like that. I mean, it's very difficult to draw at that point, but lots of these digital wiggles. That's about right. And what I'm doing here, is I'm increasing a hyperparameter of the model. And the hyperparameter is the leading power of X, in the model. So let me just show you what I mean by that. So our model here, let's say model F to FN, is a model which has X to the N, plus X to the N minus one, plus X to the N minus two, all the way to X to the 1 basically, and then plus an intercept. So that's a model intercept.
Now the linear model of the straight line model has X to power one, plus the intercept. So it's AX plus B, as we've seen. Quadratic model, AX squared plus BX plus C cubic model X cubed plus BX squared, plus CX plus D and so on. You can go all the way as we did in black there to something like a one, two, three, four, five, six, seven eight, nine, 10, let's say it's X to the 12, just a second. And lots of terms all the way down to the intercept what we're doing is we're dialing up the power of K oh sorry, power of X, leading power of X. And that's the model complexity parameter.
So we can make the model increasingly complex. The question for us is which is the best model. And as you can sort of see if I zoom on this dataset there's many reasons to suppose many reasons to suppose that the black line is the best. It looks mad but in a naive approach, it would seem the best, why? Because the black line is going through more of the points across the whole space than any other line. The red line the linear line far away from everything. The green line the quartic line a little further away from the letter points. The quadratic line in purple there, I think the quadratic line looking at this feels like the best line, but how can we formalize that? How can I make that precise? And why does the black line appear to do so well but look so bad? Let's analyze the black line for a second.
What does the black line say? So black line is gonna go off bouncing down here as well. Since it depends on the power of X, but could do that. So what is black line saying? Black line saying is, "Look, if you're at let's say this is day 10 and you go to day 12, then you have a jumping cases." You go from a hundred to let's say 300, but on day 13, you go down in the number of cases, that doesn't make any sense. So an infection an epidemic, pandemic, is not something where in two days you shoot up, in one day you shoot down, in two days you shoot up, in one day you shoot down. This sinusoidal pattern is unphysical, it doesn't match the real world infection spread. It matches our data, It matches our data for essentially incidental reasons.
Okay, maybe on day 13 we didn't have as much testing as we did on day 12, so there's an artifact. It appears to go down, we know it's really going up with this artifact in the measurements that we had taken. We shouldn't be capturing these variations in our pattern because they're not going to generalize. When I looked to the future in a hundred days time or 20 days time, I don't mind model bouncing around because it would never really bouncing around. I mean, that wasn't the phenomenon I was trying to capture, this incidental measurement thing. So what do all these other lines tell you, these all the other lines having is increasing. And it's perhaps only the purple line that gives you the right sort of increases for the historical data you've seen.
So the black line is unphysical, it doesn't make any sense but it is the one that appears to be the best given our dataset that we have. How can we select the purple line even though it appears the black line is a better one? Well, this is where of course the topic comes in of cross validation. What we are going to do is show that the model which generates the black line if you want to look at small pieces of this dataset 80, 20, 80, 20 80, 20, that model, that high degree polynomial model fails across these subsets of the historical dataset. Let's just show you what I mean by that.
So here's the data set, let's go back to this and let's now draw a line, a line which will include 20% on the right for validation and 80% on the left for training. Now, if we were to fit a highly polynomial model just to the left hand side, it would do something like this. In other words, because it cannot be fit to the validation side, when it generalizes, it generalizes completely incorrectly. So there is this huge arrow gap from the validation set. So immediately, immediately, this is great. What we've done is we've split the data into a training set and validation set and we found that the validation set score drops very low compared to, for example what would happen if I tried to do the same thing with a quadratic model.
So if I look at a quadratic model let me just use the same visualization here. And I showed left hand side. What it's going to do is something like this, perhaps. Now that's still pretty bad right? Because only fits to this left hand side data set. So it looks at the validation set is gonna be a little ways off. We're gonna do far less badly than the high degree set, and equals 12 or whatever. Now, that isn't gonna satisfy us, because it might be that the high degree polynomial happens to do just poorly in this specific range of days say or whatever range of the feature that we are using. So maybe older people high degree polynomial doesn't do so well with older people, modeling them.
Maybe it's just that set of people, and the quadratic appears to do better but maybe it just does better with just certain of people. So we're going to systematically sweep through the full range of data, hiding and showing a different 20%. So on that run, we hide that 20%, on the next run, So, you know, let me just perhaps draw a line across the surface. So if this is our 20% on the first run, maybe on the next run it's this 20% that's run one, run one, run two. And then it says 20%, run three right? And we can choose as many splits as we want. Typically it would be between five and 10. So if we choose five, then it'd be 20% 20%, 20%, 20%, 20% with 10, it will be 10%, 10%, 10%, 10%, 10%.
And as we peel or peel back or hide 10% we are showing the other 90, so 10% show, then we move down 10% and then show and then show. And in the showing phase, of course, that's the training phase and the hiding bit that's validation phase, that's when we are computing the performance. So we compute the performance on the hidden and we train on the show. And the idea is that by sweeping through all the data points, when we validate them we are allowing the model to do best where it does best and do worse, where it does worst. That is to say, not by accident in our validation phase, not by accident just happened to do particularly well.
Now, So when we're doing cross validation with Python say, what we'll get as a result of the process is all the validation steps. Let's say five, we'll get the score and we'll get a standard deviation. So the how variable the scores were. So what we're gonna do here when we come to assess a model. let's say we talk about the quadratic model. So maybe we do that in purple and you assess the quadratic model, we'll run cross validation on the dataset. let's call it D and we will get back our five scores. Let's say it'd be 90% on the first one, 80% on the second 85, 90 or 80 say. That will be for the quadratic model, that will be where N equals two, okay.
Or whatever hyperparameter you're using. Now if you look at this highly curvy high degree polynomial model, I see N equals 16 or something. We might get a fluke. Okay. 90% in the first run, happens to do very well in certain region but every other region is going to do very poorly. So on lots of different ones we'll be getting 70%, 20%, 60% 50%. And then we can see that the average here, which is what we will be using for the final score is the best guide. So let's say the average of this run average for model one, let's say is about 88%, just eyeing it. Average for model two, Oh, well, that's a model 16 or something You want to keep that in there, model 16, let's say the average there is I don't know, 52%.
Which model is the best one? Well, the one which scores best across all possible splits of the data, the polynomial model or the quadratic model. Let's see, and equals 2, the quadratic model does the best. So the quadratic model does the best. So let's now review, cross validation, it's advantages and say something about how it fits into the methodology behind supervise learning how we get to the best model, where it fits into that work flow.
Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.
His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.