Providing a general analysis using general mathematical notation - Part 2
Start course
1h 45m

To design effective machine learning, you’ll need a firm grasp of the mathematics that support it. This course is part one of the module on maths for machine learning. It will introduce you to the mathematics of machine learning, before jumping into common functions and useful algebra, the quadratic model, and logarithms and exponents. After this, we’ll move onto linear regression, calculus, and notation, including how to provide a general analysis using notation.

Part two of this module can be found here and covers linear regression in multiple dimensions, interpreting data structures from the geometrical perspective of linear regression, vector subtraction, visualized vectors, matrices, and multidimensional linear regression.

If you have any feedback relating to this course, please contact us at


So let's just wrap up, and give you some something to some notation to go away with. So what we are gonna do here, we're gonna define, rather than use this formula, we're gonna use this form in a special way. So let's rewrite that and just summarize what we're going to do. So we're gonna say that the rate of change at a point, which has a weird notation, it's d by dx. So this here, you can read this as one block or one kind of operation. And then that is gonna be applied to a function. And the definition of this thing here is the definition, so sometimes I'll use three equal signs to mean definition or we could even writ definition on there, but I'll just use three equal signs. So that is defined to be the limit of this formula, I'll explain what that means in a second. As we dial the difference, closer and closer to zero. So evaluating this formula, it can be tricky in different circumstances. So this limit means it's just whatever happens to the formula of interest as you decrease the difference down to zero. So let me give you an example with linear regression. Linear regression with the model we have, we just plug in this model. So we can just take the limit. Well, what's the model? The model is ax naught plus b, so putting a point in for x, and then we will plus delta and then we will subtract ax naught plus b. Well, that's gonna do it for the model, let's do it for the loss, I think. So that's the actual function we'll be considering. So just remember what we're talking about here the loss, where we have a and b being the variables, the data set we hold constant, and the formula for that is the prediction, which is ax plus b minus the observation squared. So you can put the loss in here, that's the thing we're actually gonna be interested in taking the derivative of, taking the derivative, finding the rate of change, derivative of f. Let's plug that in, so that would that will be, so it's this formula here with with x naught in. Is that right? x naught, so we're missing a y here. So let's do this again, let me just tidy this up a bit. Let me just put that in. So we've got ax naught plus b minus y. Now we are, am I doing this right? I'm not, so let me tidy this up just a little bit, that was a little bit wrong. Where we're going with this? It's a and then into the x, we put x naught plus delta and then we're gonna add b and minus y. So that's the first term and take away the second term, which is just ax naught plus b minus y. That's gonna be squared, and divide through by delta. And we're gonna ask the question, what happens as delta goes to zero? So to actually solve this problem, all we need to do is expand all of these brackets, tidy all of this up and find out where the deltas are, and then see what happens as we set delta to zero. Now this is a bit tricky. So let's, you can follow along if you like or you can go to the end. Let me try and do it. So let's break this into two pieces. I'll do the first piece, and say this is piece number one, this is piece number two. In the first piece we've got, let's just take the first step is being ax naught plus a delta plus b minus y, squared. Then we just square that then what are we going to do? We're gonna do, so this as a plus sign. Yep, to square that, we need to square every, so this is actually a bit complicated. There are some improvements we can make to simplify. I might, or I just go through. So we've got, let's group these together, and imagine that they are one number. Just group them together, because there's just nothing interesting going on there. So what we have then is we have ax naught plus a delta plus b minus y, and then another bracket a x naught plus a delta plus b minus y. Now we're multiplying, so we go 1, 2, 3. So that gives us a squared x naught squared plus a squared x naught delta plus ax naught b minus y. Now, I'm gonna do likewise 1, 2, 3. So let's go for the first term here again. So a squared delta x naught, let's get rid of all that, way too much. Okay, so now we've got that formula laid out, let's try and plug an example function into it. So the example we had above was an f of x with x squared. So let's plug this into the above and evaluate it. So, f naught plus delta, so we put in x naught plus delta into x squared, what do we get? We get x naught plus delta squared, we take away x naught squared, and then divide through by delta. What you do is take the limit of this thing, as delta goes to zero and that sort of means, open up all the terms, see where things go and then set delta to zero. So let's see what we've got here. So if I expand all the ls we get 2x naught, nope, it's not right. We do get 2x naught delta plus delta squared, x naught squared minus x naught squared divided by delta and that's the limit in the limit as delta goes to zero. So that goes to this, so if you wanna just follow that, you can do, so squared means multiply by itself, so it would be this times this, that times that but then you also have this times this and that times that. So you have two lots of that, so two lots of that. And that times itself, that times itself, and then minuses from over here. So look at this, we can see that this term here and that term here cancel. And then we're left with terms that have deltas in them, which is quite interesting. So that means that we can actually perform this division. So we could divide, cancel these out. And then what we have is that the formula goes to now, the limit as delta goes to zero of 2x plus delta. Okay, so how do we interpret that formula? Well, that's what that's telling us is if we look if we go back to our curve that we had earlier, and we look at our two points, this point over here and this point over here, say, and this one is x naught, so that should really have x naught in the thing there. That one is x naught, and this one is x naught plus delta. Well, the formula for this slope, for any slope between two points, is double the first plus the step. That's interesting and in particular, what we can do, is decrease the step. So it's smaller and smaller and smaller and smaller and smaller, so you get to go here, here, here, here, here, here. Imagine my drawing lots of lines there until we get to this tangent line, which is this line of minimum slope. And that would be the line, where this term here goes to zero. And so the formula for the tangent, or the formula for the minimum slope, or the formula for the rate of change, that formula, that line there, that minimum slope is two times the point we're on. That's cool. So the rate of change of x squared is 2x. And is there any thing I wanna say about this? The rate of change of x squared is 2x. So we wanna give this the correct name. This is the derivative, derivative of f of x, which is equal to x squared, that derivative is, we can call that df/dx. And again, we sort of read these here as all being belonging together really. So d by dx, is how you would pronounce that dee by dee x. That's kind of one operation, and it's the operations of actually just going through this analysis. So, it means do this to take the limit basically. Take this limit, so if you do this, take this limit, so change it to, the change in f divided by the step size, when you take the limit is 2x. So the rate of change of f at a point, the rate of change at a point we call a point x, for any point x, that rate of change is 2x. Cool, now, so what this does? This gives us a system for working out in the machine learning context, how to compute the change in the last, add into the into the optimization formula we had. So when we say, update a, by some amount, or that's going to be a percentage of the change in the loss. Well actually what we're going to do is we're gonna do a percentage. So we can just leave it at this present time for now, the percentage of dloss by da, how much the loss changes, as we change a. All right, and maybe to give it even more mathematical flavor, this is a percentage of say lambda, which has been some percentage in this case, times dl by da and the formula here, the full formula then is a will be updated to be a minus that. So, that's the update formula. And this here, this last piece, that's the process of as we've just seen, of taking that limit. Now, there are textbook ways of taking this limit, you don't actually have to go through this analysis every single time. There are recipes that allow you to get to the answer pretty quickly, without having to go through all of this analysis. But the general approach is the, that's the general idea is that you were asked what the derivative of the loss is. And then you will take a little bit of that and that will that will be your update for your parameter, your parameter a. Alright, cool. So I think what we should do now is maybe have a look at some Python to put some meat on the bone of this to make it a little bit concrete, and show you how all of this gets used. Well, to give you a sense, the practical use of this mathematical information.

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.