Linear regression
Start course
1h 45m

To design effective machine learning, you’ll need a firm grasp of the mathematics that support it. This course is part one of the module on maths for machine learning. It will introduce you to the mathematics of machine learning, before jumping into common functions and useful algebra, the quadratic model, and logarithms and exponents. After this, we’ll move onto linear regression, calculus, and notation, including how to provide a general analysis using notation.

Part two of this module can be found here and covers linear regression in multiple dimensions, interpreting data structures from the geometrical perspective of linear regression, vector subtraction, visualized vectors, matrices, and multidimensional linear regression.

If you have any feedback relating to this course, please contact us at


- Okay, so let's talk about linear regression. Linear regression. I recall, this is a form of supervised machine learning. So we have thing we're trying to predict, the thing we know, this is the target. And this is the feature. And the target in the case of regression is a real number. That's what makes it regression. And linear, or linear regression, well linear is about the prediction function we use. The prediction function here is f hat. That's the function we use to predict . It's gonna give us our Y hat, taking our feature, and the prediction function has to be linear. And what that means is it's just going to use multiplication by a constant and addition of constants. Okay, well, why are those operations called virya? It's because if you use only those operations, when you visualize the thing got a straight line, basically, roughly speaking. So, you know, what that means is that the thing that we will learn will be a times x plus b, basically. So there's gonna be some number a, some number b, and you can multiply a by x by something, or you can add a constant to it you can't do anything else, you can square it or do anything else. So in this case, a and b are both gonna be real numbers as well. So a is gonna be a real number. b is gonna be a real number. But in this case, of course, x will be an entire column of numbers. Right? So is gonna be some, some column of numbers and able to be as specific. So I'll show you what that means. Now. So there's the kind of general setup. Let's show you what we're talking about. That's real graphical of axes. Let's use problem as well. So, I mean, the classic thing here I guess, would be profit and maybe let's take a grade, predicting someone's grade. So trying predict someone's grade and write that on there. So that grade will be the y. And the x we're using here, that's going to be, let's say, you know, the number of years in education or their last test result, or how long they've been studied, how long they've been studying. So, hours of study, that's called hours of study, for a particular test. Let's draw the problem set up. So as the data we've observed the historical data, the training data that we've observed, let's draw that in on in black or something. And that's just going to be hours here. And y here is gonna be grade. Let's do y out of 100. So 100% on the exam, and let's say it's number of makeup total around if the number of hours per week. What should we do here? Let's do let's just put 100 on for the sake of simplicity. So, 100 hours, you know. So maybe someone said 100 hours, you know, my sense is probably if you're studying 100 hours for an exam, actually probably not getting 100% probably someone who's struggling a little bit maybe. So maybe let's say most people study 50 and get pretty high, 90 percents, maybe most people around here, actually, you know, the more you study there's probably these people are actually not necessarily getting a lot of extra benefit from all an extra study maybe. So and then there's probably some people down here who are getting even the high marks even though they're not studying as long. Probably hopefully most people for the sake of simple example, can I'm just gonna be, you know, trending trending downwards, that the lower the numbers the, the lower the grade. And you know, we grade zero will be a real grade. So, probably someone who just didn't turn up to anything, did in fact get zero on the exam is real. The possibility, okay, good. When we solve this, then when we solve the linear regression problem, what that amounts to is drawing a straight line through this data. So we've got some let's draw it in red, and there's a solution. So, you know, what is that line? What does it mean to draw that? Well, but you know, that is the solution. What does it mean to solve the machine learning problem? Well, it means to basically arrive at a b here, which is the intercept, and an a, which is the slope. And to have guessed, or have arrived at somehow an optimal a, optimal b, that gives you some good predictions. So in this case, what I've said here, is kind of this almost a one to one. So as someone who spent 50 hours gets just above 50. So maybe we say here that the a is about 1., a hundred hours, a hundred percent , I don't know, maybe let's go for point nine. So if you spent a hundred hours, you get 90 roughly. I mean, let's say b, let's say that, you know, very few people are getting zero. So mostly we're starting from, let's say, 10 points on the exam or five points on the exam. Zoom back in. And that b, we got a five points, on this thing here. Alright, so to be a minimum, a minimum grade. So let's just write this prediction function out. And so we've got, we've got this little f hat here, which takes in a certain black ,takes in an X. And ,the numbers are not point nine, for a times x, which would be whatever we put in for a person, plus five. So let's just run through a very quick couple of examples there. So if I do f of, suppose suppose I just spent 10 hours in an exam, studying for an exam, it gives me nine marks. Not new, is that right ,nine marks yes, nine marks plus five, that's going to be 14. So, apparently that's going to be 40 marks or percent percent. So, pretty aggressive, like, pretty aggressive kind of thing that you know, you know, I guess most people are spending 50 hours and there's like a pilot exam or some really complicated thing , is maybe some university exam or something, but maybe something really big, driving exam, I don't know. That requires you to spend quite large number of hours on it. Okay, so let's talk about how this problem gets solved. In other words, how do we find an optimal a and an optimal b? Well, we need to introduce a loss. So here we have, means part of the setup of the problem, we just say, okay, there's a loss, that takes in a y hat or predictions are y and gives us some error. In this case, the last we will use will be the mean squared error loss, which will be y hat minus y squared, and the total loss will be the sum of that. So the sum of you may even do as an average if you want to, so you know, lots of each point. Divided by n. So the mean square, doesn't really matter whether we take an average or not, because what we're doing is we're minimizing it. So there is a goal here, or to solve the problem, what we're doing is we're minimizing , the loss or total loss, we say total loss L, minimizing the total loss. And how are we gonna do that? How are you gonna minimize that loss? Well, we got to sort of time, you know, somehow select a different f hat. This notation here means that we're gonna, this is constructed to give you some words for this b, to minimize, minimize is the word there. So minimize and then cause an objective or something like that and objective, minimize the objective by varying and then you put your, you know, so we just put the boxes here. So the objective is like little , a is variable so you can change what the objective is, you're gonna minimize the objective, you're very important, what you're gonna vary is, is up to you to change as well. So, so this is what we're gonna change and we change this. And that's we're gonna try and minimize, right? Okay, so how do we do that? How do we? How is that gonna work? Well, let's start by rephrasing the loss function just a little bit. So it's clearer how the loss is going to relate to the, objective. So sorry, how the position function f hat will relate to the objective. So let's look at the loss function. So the loss is y hat, y equals y hat minus y squared. So, what's going on here? Well, y remember that y hat is just what we get when we run our prediction function on something. Right, so enough, so the first step here, in rephrasing the loss is to say actually, it depends on f hat, and then it becomes clearer how we can see now, that by varying f hat, we would actually change the loss. Okay? Now I wanna keep going with this notation a little bit so that you know, so we can make it even clearer. And one thing I want to introduce , is a different way of talking about the arguments of a function. So let's scoot back up there and look at our prediction function. So now our prediction function has two kinds of parameter even though you can only see one in this notation here , we have got two kinds of parameters, we have got parameter a, and kind of parameter x. So whenever I, whenever I run this, I would need to give it a particular person's age lets say 10. But then I've also got like a kind of programming code might think of this as default arguments, but somehow I could have chosen anything for b. But this is kind of default to somehow, a preset to know point nine and five. So we think of this as two kinds of parameter the f has, f has what do I say, is just ordinary variable. It also has two fixed parameters or a and b, so that's a x plus b, and, you know, so this notation here means that you know, on the left hand side here, we have the variable, which is a column of numbers that we will put in. On the right hand side we have parameters. So you can think of maybe the most arguments possibly, but we call these parameters, and these are just fixed numbers, whereas this is a variable, and then we use a semicolon to distinguish between them. Now, this is just some helpful notation for the reader. And it tells the reader actually, that you when you're reading this, you should treat a and b as being fixed and x is changing. So that tells you how to read this formula, a and b are fixed and x is changing. Now, you could swap these around, put a and b on this side and X on that side. And all that would mean is that, when you read this formulation, think of x as the thing which is fixed and a and b is the thing which is changing. And there's no ,you know, this formula doesn't mean that anything is fixed and anything is changing, that you have to interpret it and this gives you a guide on how to interpret, you interpret it in both ways, right? Let's go back down to the loss. And let's put a plug ,in this fall sort of definition here. So that's loss of the prediction function, where a and b are the things that we're gonna to plug in, and then y and then maybe we write our, all over again over on this side. So you know, f hat x a b minus y squared, now, we know it okay. This is like a general with, phrasing things a bit right? Since you know, on the on this side, well, general everything, but now what I'm gonna do is, we know what x f is going to be. It's just gonna be some function of x. It's gonna be as linear functions is actually know what the left hand side here is. So maybe rather than just just redoing it like that, that's actually spell out the full, assembly throughout the full formula for this particular linear regression problem, it's going to be, well, it's going to be ax plus b minus y squared. Cool? We, keep going with that. Expand the brackets and things, so on, but we'll leave that for now. Okay, so there's, a formula for a loss. That it's x plus b minus y squared. So it's our prediction. This is your observation. And these are the things that we can change. These are things that can change. So we could even now, kind of maybe rewrite our minimization go, go here is to minimize loss by changing f. Well, you know, how do we do that? Well, let's just assume that in this example, it's the same as saying same as for this example, minimizing this formula above ax plus b minus y squared. And how are we gonna minimize that, we're gonna change a and b, note the key things here, that by changing a and b, remember what those are, those are the intercept and the slope by changing a and b. We can minimize that formulae.

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.