Start course
1h 45m

To design effective machine learning, you’ll need a firm grasp of the mathematics that support it. This course is part one of the module on maths for machine learning. It will introduce you to the mathematics of machine learning, before jumping into common functions and useful algebra, the quadratic model, and logarithms and exponents. After this, we’ll move onto linear regression, calculus, and notation, including how to provide a general analysis using notation.

Part two of this module can be found here and covers linear regression in multiple dimensions, interpreting data structures from the geometrical perspective of linear regression, vector subtraction, visualized vectors, matrices, and multidimensional linear regression.

If you have any feedback relating to this course, please contact us at


- So now we're ready to move on to talk about calculus, a key area of mathematics, that you kind of need to get a grapple with to understand machine learnings. Calculus, calculus, hopefully that's spelled right. What's it about? It's about change. About change, especially about rates of change. And how does this come in? What's a simple way into this? Well as we saw with the update rule from earlier that if we wanted to update some parameters, let's say it was A and we wanted to get a better one. Well, what we could do, so if you add a new A, we could take our existing one, which maybe we start at random. It's have some existing A, and then add to that. Some, I keep saying add, and subtract from that, some subtract from that like some change. As you saw I just used the word change in blue. Maybe some change in the loss when we vary A. You know, so some change in the loss. So it's, this part here, that we need Calculus to understand. How will we know? Where do we get this rule from? What is this rule all about? How is it working in principle? So to do that, we kind of have to take a bit of a step back, and think about just changing things in general. So, so I got a grapple with calculus and this analysis of the rates of changes of functions. Let's consider our functions. So let's just consider one. Let's choose. Let's choose X squared. So F of X equals X squared. so let's put some sample points into that, and see what happens. So if we put in here zero, one, two, three, four, as our input, what's the output? Well zero goes to zero, that's fine. One goes to one. Two goes to 4. Three squared is nine. And four squared, that's 16. So a little preview here is about, you know, what's the issue here is well, you know, if I'm changing. If I'm changing this by one step, now what does this change by? It's obviously not the same each time. That's changing. The minus changing is itself changing. So one is going to four. That kind of appears to be three. That kind of appears to be five. This appears to be 16 minus nine is seven. So, there's something interesting going on with these steps. Now they're increasing at a rate of two aren't they? Increasing at a rate of two. So these are some of the kind of issues that we're interested in. The rate of change. Before we get to that point though. A little bit, let's just diagram X squared. So, put some axes on. Let's say this is zero. this is zero Let's call this X. We can call this F of X, the output of the function on the vertical. On one axis we need one, two, three, four, and on the other one we need to get to 16. So if we just do one, two, three, four, five, six, seven, eight, nine, 10, 11, 12. Let's leave it there. So that's going to be one, we'll do them too, so that's going to be one, two, three, four, five, six, seven, eight, nine, 10. And then let's do the kind of similar thing on here, so it's gonna be one, well same size, two, three, four. One, two, three, four. Because you put these points on. It's pretty steep, so yeah, zero, zero. And then a one, one. And go for two, four, so we see a big jump here. Now go for three, nine. And then for four, sort of off up here some. let's just try drawing this freehand. It's a little tight on the diagram here. So it come down like that you see on the other side. I go up curvy like that's all I'll do for now. It all shifted but whatever. That gives us a place to work with so, let's kind of think about the the question we have. The question we have is, the question in calculus is, okay well I'm making steps in this direction yes, I'm varying the import of the function. Well what's happening to the output? You know what you know what's happening on this side? Well, it's not steady. It's not constant, is it? It's not changing by the same amount. There's a change here of one. Then there's a change here, so we're climbing, you know, vertically. We're climbing vertically here by three. Then there, sorry. Climbing here vertically by five, and we go to seven. So we're getting these increasingly, you know, we're going one, three, five, we get a bit smaller. One, three, five, seven, we're getting, it's actually. So what's happening is you're sort of getting like, a one, a three, a five, even bigger, and it's actually, it's increasing you know by the same degree each time. So it's doing a plus two. It's doing a plus two, but its getting bigger and bigger and bigger. One, three, give, seven. Right, so that's how much it's increasing vertically, and notice that for each one of those increases, those vertical increases, we're making the same movement horizontally. So we're actually just doing a plus one here, a plus one, a plus one. Okay. So before we try and give this some formalism, let's just re contextualize this in the machine learning. So what questions are we asking about change? Well actually were asking questions about the loss. So, with linear regression, as we mentioned earlier. Linear regression. The formula for the loss is this square error. So it's a loss where we're changing the A and B and we're holding the X and the Y fixed. So that's where we hold the data points fixed, but we change A and B. And the formula here is, it's going to be AX plus B, to give us our prediction, minus Y, how far away we are from the truth. And then we're a square. so if you just sort of squint at that formula, you can see it's going to give you an X. Well that's right, we're changing A. It's gonna give you an A squared term. So if I just diagram that and I have an A here, if you call this L for loss, that'll be L of A, as the output. If you just choose some figures for B and X and Y, so these are gonna be fixed numbers, because we're varying the A here. We're holding the data points fixed. So let's suppose I put in for X just, you know, a point five and apparently the Y at this point, you know let's say five was the age of the user, or something, five years old. And how much money do we make selling sweets to this person. Well we made ten pounds selling sweets to this person. So that'll be fixed. We'll hold B fixed, so you choose zero and make it simple. So it gives us, you know, five plus zero minus 10, and we're gonna square. Then A is a thing we're gonna vary. So if we choose for A, that's the range, here zero to ten. Let's say we try to put in here five, if I put five in for A, that'll be five times five minus 10, squared. The output of the loss here is zero. So, on this point we have, yeah. What you'd see is, if you increase A in this formula, just holding five, ten and zero fixed, you get going up quadratically, like the shape we have it above. And on decreasing it you would get it on the other side, so you would sort of have this kind of U shaped curve, as we had above. So the question for optimization, and for getting better and better values of A, is okay, suppose I start at random. So here's my random choice of A, it gives me this loss. It gives me that loss, which is a bad loss. Is there something I can do to A to move it down, to a point where I get the best I can ever get. The best answer. And it's just yeah look, I mean look, you know, if I started random, suppose this is the random start. There's obviously these little tiny updates, well these would be, we're going going negative in A, so we're decreasing a here. We minus something. something, let's say minus five, and minus seven, minus nine. You know rapidly decreasing or something. And maybe as we get closer we would hope that these changes would be small and small and smaller, to a point where we hit the best one. And then we sort of stop changing. So if we zoom in to this point here where we kind of stop, what we hope is that around this area, that you know, if we step too far, we come back, and if, you know, and then we come back, and at some point we hit this best point, and then there's no need to change, because we're the loss isn't changing. So at some point, you see that this very bottom of this U-shape, so if we draw the U-shape as a really big thing here, and if I'm right at the best value. If I'm right at that very best value of A, well you can see if I draw it carefully through that point, that at this point, you know, if i zoom in even more further, you know, if I move a little higher in A or a little lower in A, so I increase or decrease A, actually the loss isn't changing. The loss isn't changing. So it's flat at this point. So that gives us our stopping condition, kind of, it says well remember while, you know, the loss isn't changing, or while our parameters are basically not being updated. So we're not gonna move A around at all here, because the loss isn't changing. So if we're not adding anything to A, because, you know, if A is gonna be updated with some percentage of this change, and if there's no change and nothing to update it with. So we would say while the old As is, you know, is approximately equal. If the old A is approximately equal to the new A, then stop. That would be a good condition to stop because then we would get the best A. So that's kind of the background this. The the function we were interested, in analyzing in this way is the loss. So we'll be looking at the rate of change of the loss. Okay so when we come back, I think well we'll do the formalism behind it, give you some more notation and get further into calculus.

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.