Supervised Learning - Part Two
The course is part of these learning paths
Supervised learning is a core part of machine learning, and something you’ll probably use quite a lot. This course is part two of the module on supervised learning. It takes a look at hyperparameters, distance functions, and similarity measures. We’ll wrap up the module with logistic regression, the method and workflow of machine learning and evaluation, and the train-test split.
Part one of supervised learning can be found here and introduces you to supervised learning and the nearest neighbors algorithm.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
So we've considered Linear Regression, Kernels neighbors, now let's consider Logistic Regression. So whereas Linear Regression is a regression algorithm and Kernels neighbors is both a regression and the classification algorithm. Logistic Regression is only a classification algorithm. So it's a little bit unusual. Yeah, given that it is named regression. So Logistic Regression, and this is a classification algorithm. So why is it called regression? Well it's called a regression because you will first perform a regression before performing a classification. So I'll show you what I mean by that. So what are the, let's just remind ourselves what classification is. That's the case where our target is a discreet number. So plus one or minus one or something like that. A binary classification, If we visualize this, you know we can see, you know, two features, feature one, feature two put some points on so fraud, not fraud or like dislike, however you wish to say. And so to solve this problem then is to come up with some model which allows you to classify a particular point into a particular group. Now one can visualize the classification boundary as a line that separates the feature space into two regions. So there's a region which is green. You call that maybe the positive region or whatever you like. And then the region which is red, which is the negative region. So when we're dealing with classification, sometimes we call the feature space a decision space cause you're making a decision about where things are. And we might call the boundary, we might call the line rather than calling a trend line in the case of regression call it decision boundary. So this thing, so the model here is the decision boundary or implies a decision boundary. Now, so the way the model works, could either be as in k nearest neighbors, the way the model works is just to give you a prediction for each point. So just give you a prediction for each point. So the way that you work out what the decision boundary is. In the case of the k nearest neighbors, is just to try every point and see what color you get or what prediction you get. In the of logistic regression, you actually learn this boundary itself. And then what you say is, is it above the boundary or is it below the boundary? So in one case you've the boundaries kinda implicit, it's not there in the model, the model is just, here's this color, here's that color, just putting this thing in prediction. But in the case of logistic regression, you learn the boundary itself. And the way you get the prediction is to ask whether it's above or below boundary. Okay, so it's called regression because you actually learn the line, you know, see you're actually finding that line. But how does logistic regression gets solved then? Just to show you the case of a one dimensional logistic regression. So 1D logistic regression. So here, the formula for the model is have some parameters leave those definite, but the model itself has a nice formula wise. It's nice as formula, which is 1/1+e to the minus a x plus b. So, this is quite interesting, we have a linear use of parameters. So the use of parameters is linear. So in other words the parameters themselves only multiply x and add it to it, but the model is an highly nonlinear. So if we look at what the model is, so if I just visualize it, so here is x, and here is the output. Now, in this case, the output isn't going to be aprediction. I just call it a score. See why that is in a second, but we have an output s. And the shape of the model here is an s shape like that. So, in general it'll have an s shape. But this is a sort of zoomed in my portion and it would continue somewhat flat out like that on the edges. So it's called a sigmoid. Sigmoid, that just means looking like an s. So it's a sigmoid curve. Right, so and then all we're going to do to solve the problem using logistic regression is put on points onto the surface and then have, have the algorithm figure out the best parameters are just like linear regression. So what we do is we just put on some points. So , supposing x here is, you know, heart rate and our target here y the target y could be whether we had good or bad sleep. So good or bad. Which would be plus one. And we can use one zero here, sometimes, often in when we're using logistic regression sometimes we'll use one and zero for reasons that'll become obvious in a second. So if I say good is one and bad is zero. Then I want you to treat this vertical here as a y. So this will be one and that will be zero. And then we'll just put the points on. So let's say we have a few observations of good over here. And then a few observations of bad let's say mostly up here say. So, just interpreting this diagram then we have this vertical axis basically says is our y still, we have a one and zero and we put on the points that we see. So we see some goods in our dataset, good, good, good, good, good, for this range of heart rate. So this heart rate say is, you know up to 60 maybe. So between, you know, maybe this is starts at 30. So between 30 and 60. You know, most people are having a good sleep. And once you get 60 to 120, say, then you start having a bad sleep, and of course, that probably would be occasionally people having a bad sleep within that lower range still, and maybe people having a good sleep in the higher range. So maybe if we just, you know, maybe we could put here to point also, just to say actually, there's some people who had a good sleep there. So it would be in purple according to the color, colors I was using. Right. So good would be purple and bad would be red. So we have one feature here, which is this x and then we have kinda yes and no or good and bad in one and zero in the y. So using one zero as the numbers here to represent these. And if we use one zero, what we can do is actually interpret this vertical axis as the probability, we can interpret this as probability and what would it's probability of good. So the probability that we have good as our observation, given that we have a particular heart rate. So, if this is zero, for any given heart rate, then we can interpret that as, you know, zero probability of good or, you know, a bad sleep. So the higher we go along this curve, the higher we go along this curve, we might think is the higher and higher probability is. Higher and higher probability. So that's the logistic regression model, which is to say, let's take this kinda binary data one and zero data. And let's interpret as a probability. And if we do that, then we can kinda fit this more probability stock to it. And then we can interpret the height of the point of prediction as a probability. So if I put in here, a heart rate of 60, let's say I have an unknown point, x for an unknown point, say, is gonna be 60. The question then is what is my prediction for y? Well, you know, I put 60 and I go to the curve. This tells me that the observation here is for the is 0.7. And then I can decide what I wanna do with that 0.7. So the obvious interpretation here, what the obvious interpretation here is a 70% chance of y being good. So the probability of a good, of a sleeping good given that we have a heart rate of 60 is 70% to 1.7. And so if I think actually, that's quite high, that's quite high chance that it was good. And if I think maybe, you know, anything above 0.5 say, if I say, well, here's 0.5, maybe if it's above 0.5, I say, yes I predict, yes it's a good sleep. If it below I predict no, it's too bad sleep. Well, then I would predict that it's a good sleep. So the model there is, you know, so the model isn't really quite just this function. It's whether this is more than 0.5. So whether the score, so this is a score I shouldn't do that way. So if we think of this formula being the score or the probability score you could say. So the model is whether the score is above 0.5. And if it is, if it's above or equal to we can predict. So here we could say, you know, mathematically, we would just say. Well, several things you could do mathematically here to give some notation to this, but we could say that, one reasonable thing is good. If you know that bad sleep, say otherwise. Then that would be perfectly within the definition. Or we know another thing we could say is, well, if the score minus 0.5 is positive, so we could take the sign of that score. So that would be negative if it was above 0.5 and positive if it was below et cetera. So, that would give us a plus minus one, either you could do that way as well. Right. So this is another approach to classification, which we'll review in just a second. But I think we're gonna, we'll look at the logistic regression in a statistical context more later on when we consider statistics. For now, I think it's just gonna be another algorithm in our toolbox as a way of solving the classification problem. So, to try and finish off with some intuition behind this Again, the idea is that, you know, we interpret this we interpret the Y, we interpret this y as kind of a probability. So we saw someone with a, you know, we will certainly work hard to go sleep certainly had a good sleep certainly had a good sleep, certainly didn't, certainly didn't, certainly didn't. And then we come up with this formula for this curve. And then we find, so we fit the curve to this dataset. And then wherever the curve tells us, the halfway point is, we predict yes, no. So just maybe go through that once more. So if I have, you know, some feature, and I have some points that come out, one, you know, on this axis some points that come out zero on this axis, what I do is, I find a curve that goes between those two points like that, then I find the halfway point. So there's 0.5. I sort of draw my boundary here, then you think of the actual model. So the model itself, f hat is actually this red line, because that's the point which the data set is half up 50% probability above 50% probability below. So the red line is a model. This thing here, the black line, that's the scoring function. That's the score. And then this is the model. The model here, quite interestingly, actually is a linear model. It's just a straight line. So it's just saying things above, I get pretty positive cases below the predicted negative case. Now one of the virtues of average logistic regression model, one of which is nice and simple. But also you can start a tune, where you place your a tolerance for how probable you think something needs to be before you classify it one way or another, you can tune that, and have some control about how the algorithm performs, on the different kinds of tolerances to risks and maybe you think, you know, this is a loan, maybe you'd want to be let's say, in a 95% sure that someone wouldn't default on your loan before you gave them it. So we're gonna consider that tuning of logistic regression in a section about model tuning. But for now I just observe that it's here. And so just to be precise about the solution. Then, you know, this is solving the same with the linear regression as in the sense that the model just has, the scoring of the model here is the scoring function plus the tolerance. But the scoring function has these parameters a and b, and all we're doing is just finding a and b, you know. Again, with linear regression, you could just interpret that as just drawing lots of straight lines and choosing the best, here you could think of it as just drawing lots of s curves and choosing the best. So we'll just draw them at random. You'll just draw lots and lots and lots of curves and just choose the one that happens to fit these data points the best. But we'll start at random. And then we'll tune the position of the curve using the gradient of the loss, loss as a kinda on call guide to whether our curve is good or bad. And then we sort of settle at a point where it fits most the data well. And then that will be our that'll be our scoring function. And then we will take the output of that function and compare it with, you know, whatever probability threshold we're interested in, and then that will give us our prediction. Yes no, good bad, up down. In two dimensions, we draw two straight lines. So you know, two dimensions x one, x two. Let's say we have some fraud down here, some not fraud up there. Well, what we were hoping to get with the model is that there's a cutoff point of 50% probability in x one, a cutoff point in x two. And then we get, you know, this is sort of straight line going across there, straight line going across there. And we say, if you're below this line and below that line, then we predict not fraud as a fraud. And if you're above this line, and above that point, you would predict not fraud. And so we would get this sort of, we would be able to split the surface into two using two straight lines like that. So we would learn s curve here. So you can see the s curve in x one kind of does that. And the s curve in x two is sort of rotated a bit, but it's sort of lemmy draw in purple. You know, so it's kind of in reverse, like that ish. Yeah, that's sort of it so you can kinda see that again, that sort of drawn it. The split point will be here and there and here and there. And so you get this region here and that region there if you didn't draw the comparison. So yeah, but of course, the model, what I just want . What I wanna say from this 2d case, taking the takeaway from this is really just to emphasize that the model itself is a straight line. So it's actually linear model. The model itself is just a straight line cut in the surface here separating out into two square regions. But the scoring function is nonlinear. So this is specific pick up on kinda nonlinear behavior in the scoring function. And then we can combine multiple regression. So regression is a regression that has one logistic regression next to, you know, and even x six x four x five to build these kind of nonlinear boundaries so that you know, it isn't that the boundary itself, here is a straight line got this big bend in it. First of all, it's got this little massive kink, which isn't a straight line. Because we're composing together solutions for different variables. So that's an interesting idea. So I think we now, we'll wrap up there on logistic regression. Consider more in the tuning section consider more in the statistics section, where it's a more appropriate kind of model slightly more mathematical, need a bit more of a detailed mathematical analysis to really understand the complexities behind it and how we can tune in the implications there. For now let's see it as an example of another machine learning algorithm this time, just for classification.
An Overview of Supervised Learning - Nearest Neighbours Algorithm - Nearest Neighbours Algorithm for Classification - How the K Nearest Neighbours Algorithm Works - Hyper Parameters - Part 1 - Hyper Parameters - Part 2 - Hyper Parameters - Part 3 - Distance Functions and Similarity Measures - The Method and Workflow of Machine Learning, and Evaluation - The Train-Test Split
Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.
His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.