Finding the Model with Linear Regression - Part 2
Start course
1h 30m

Machine learning is a big topic. Before you can start to use it, you need to understand what it is, and what it is and isn’t capable of. This course is part two of the module on machine learning. It covers unsupervised learning, the theoretical basis for machine learning, model and linear regression, the semantic gap, and how we approximate the truth. 

Part one of this two-part series can be found here, and covers the history and ethics of AI, data, statistics and variables, notation, and supervised learning.

If you have any feedback relating to this course, please contact us at


Okay, now here's a problem. If I try and say how good or bad is this red line, how good or bad is this red line? My rental problem. My problem is this, that all the time that I get my prediction wrong because I underestimate is gonna kinda hide all of the times I get it wrong because I overestimate. Let me show you what I mean by that. Suppose I overestimate by one minute five times. So I get, so each error is one minute, plus one minute, plus one minute, plus one minute, one two three four five, and these are gonna be errors due to overestimating, right? Suppose I only get one point, get one person, say, wrong by underestimating them for five minutes. That'll be errors over, so I'm under there, right? There we are, and that's gonna give us zero, so if I just try and kind of naively add all these errors together and say well, okay, how wrong am I in total? Well, in this situation I come out that I'm perfect, doing perfectly fine, which is obviously not right, is it? So these errors shouldn't cancel each other out, they should add all of them. So we can either just ignore the sign or sometimes what mathematicians will do is they will square things, so here what we're gonna do is square each of these errors, which makes no difference in most cases because you're gonna magnify this negative error and make it positive, so the total error now is not going to be zero, it's gonna be well, five, one, one, one, one, five, and then five squared is 25, it's gonna be 30. Now you may worry, maybe, if you thought the error somehow meant something true and important and real, that we are kind of like making things up here just by squaring them, just putting a square on things well, what does it mean to say minus five minutes squared, it doesn't mean anything. We've kind of exaggerated the error from the negative case, so we've made that really important error, we've made it 25, and we've kind of just done nothing to the other errors, doesn't that come as somehow biased, what we're doing, or does that not, is that not kind of suspicious? Is that not statistically suspicious, that we're just sort of squaring things, I'll be just sort of arbitrary just to get a positive number and yeah, it actually is a little bit suspicious and it has certain implications. It does tend to provide too much influence to ouliers and this formula then can actually give us a line, which if we use this formula in this way, that gives us a line which is gonna be considering of outliers too much, it considers outliers a little bit too much so we might not think is right. But what we're doing here is we're making approximations and that's it. Machine learning is not the art of discovering the truth, unfortunately. It isn't science, it is really just a system of trying to do as best you can to approximate something which you probably can't really approximate very well anyway but you just keep trying to do it and limit information on poor information sometimes, and there is ways you can get and actually squaring the error doesn't really make much difference. If you consider the error to be 30 or to be three, as long as the error is small, you're doing better than if the error is big and that's the key thing about this notion of loss or notion of error and actually the kick in there is we don't really care about what value the error has, we just care that the value's small. So what we do here then is when we're computing the total loss for this line which sometimes you can give a capital L for the total loss of the line on the predictions, we're gonna do a mean there, so we're gonna say one over the number of points that there are, sum of all the individual losses for every point, so you can think of L1, L2, L3 had been all the points or LA, LB, LC, LD for all the different points and that's gonna give us the mean square error in this case because our error is gonna be the square error instead of using mean and so the mean square error is kind of on average how well is our line doing. Right, okay, so that's something, so we have this notion of loss that's gonna tell us how well each point is doing. We compute a total loss that tells us how well our line's going and still hasn't explained how on earth to get the damn line in the first place. Well one thing you might do is just imagine that all you do as a computer is just trying maybe thousands of lines, compute the total loss of each and then choose the best, so let me show you what I mean by that. So here you have, here's some historical data, here's a line, there's another line, there's another line, there's another line and so on and maybe the computer just keeps trying all thousands of different kinds of lines and it goes, oh wait a minute, when I compare all these to the historical data set, maybe, let me just highlight something, maybe, I don't know, maybe that line there kind of has the minimum total loss compared to those black points so maybe just chooses this one. That's a genuine algorithm. That's gonna kind of possibly work. In two dimensions anyway 'cause you can just maybe just try out some lines but we'll leave talk about other algorithms to the section on supervised machine learning and for now, what I want us to do is just beware of loss as a notion, estimates, features, targets and the notation behind that so that we are comfortable with the setup and the setup is what? Well, let's have a quick look. We have feature x, we have a target y, we have this historical data set and somehow we are going to find a relationship here f hat which allows us to estimate, for example here, the value of a point when we don't know what the real point is. It's now gonna stand in for what the real, for out where our prediction's going to be, y hat, and as I said, the way that we're gonna find that good line is by using this notion of loss that tells us how close we are. Now, just there's just lots and lots and lots of algorithms for computing, calculating, coming up with this red line and each will give you a slightly different one in fact so there isn't really a true way of doing it but I think now we have all of the notions that are really underpinning the set of machine learning. So now we can talk a little bit more about the issues in the background, about what are we actually doing? Is this prediction? Is this estimation? Is this maybe, just doesn't look very principle. This looks kind of like drawing lines through stuff and seeing what happens and in fact, yeah, kind of it is.


Unsupervised Learning - The Theoretical Basis of Machine Learning - Finding the Model with Linear Regression Part 1 - The Semantic Gap - Approximating the Truth Part 1 - Approximating the Truth Part 2

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.