The Theoretical Basis of Machine Learning
The Theoretical Basis of Machine Learning
1h 30m

Machine learning is a big topic. Before you can start to use it, you need to understand what it is, and what it is and isn’t capable of. This course is part two of the module on machine learning. It covers unsupervised learning, the theoretical basis for machine learning, model and linear regression, the semantic gap, and how we approximate the truth. 

Part one of this two-part series can be found here, and covers the history and ethics of AI, data, statistics and variables, notation, and supervised learning.

If you have any feedback relating to this course, please contact us at


So in this video, I wanna talk about more of the theoretical basis of machine learning. Give you a sense of the conditions under which machines can learn well, and maybe, hopefully that'll give you some insight into when they can't. And it's really important that as a practitioner, you are very aware of the quality of your data and those conditions. Otherwise you're going to produce systems that are appearing to be useful but actually may be quite dangerous, or quite unprofitable, catastrophically unprofitable, if just used blindly and naively. So let's talk about the setup. And really for most of this time, and probably for most of the time people discuss machine learning, we're gonna be talking about supervised learning. And the reason is because machine learning really means, kind of sort of prediction mostly, right, and these other little techniques, clustering and so on. But really we're thinking about prediction. So, we're gonna look at supervised learning, and that's gonna be the big umbrella under which we're gonna analyze, you know, most of the, most of how machine learning works. So with supervised learning, what's the underpinning here? Let's just recap some notation. We have x that's a feature. Then we have capital X which is just some features, right, that would be multiple columns. So it would be a way of sort of summarizing you know, notionally, just shrinking down, saying x, y is two, three. Y which is our target. We've got y-hat which is our guess for our target. Our guess for y. And we've got f, which is the relationship between, between y and x. Now, here's where we need to be more subtle, in how we understand what's going on. So, with this notation here, let's try and get a sense of what's really going on. So, when we solve a supervised learning problem, what we do is we imagine the world, is like something Meaning that, there is some genuine relationship out there, between things that we're interested in. So let me give you some good examples, right. So, if I say, you know, here's a persons age. Let's do something different, we've got age, let's say, y could be the thing trying to predict for a person, is how much they'll spend in our shop, that could be a profit. How they'll rate a film, or it could be just a rating out of 10, for this person. How long they'll spend in our website, that's a different sort of thing isn't it? So that's gonna be time on website now, for our website. Okay, now what I imagine is that in the world there, there's some reality to these two variables, you know, people have ages, so we're pretty okay on that one. People do spend time on a website, that seems fine. And then there's a real, genuine relationship between those two things. Okay, so let me just show you, let me just write that down. So we've got a real relationship, that determines, y from x. Now here's where things start to go wrong a bit, because actually there really isn't that careful or precise relationship between age and time website. In other words, you know, if I put into this little function here, if I put in 18, well, how long is an 18 year old gonna spend on my website? I don't know, maybe three hours or three minutes. If I put in here, 20, I get maybe six minutes, but could I imagine a situation where I have an 18 year old, and they spend six minutes? Yes I could. What does that mean? It means that there isn't actually a single principle, correct, true way, of connecting someone's age to how long they're gonna spend on my website. So there isn't really this precise, genuine, true relationship, for every given age, I can predict. Well I can know in reality that this is how long they will spend. So we have to come up with some kind of, way of approximating that somehow, right. Well let's set that issue aside for the moment, let's just, we'll come back to it, but let's just imagine that, you know, there is some nice connection, that you know 18 goes to three, you know, 18 year olds spend three minutes. Maybe we could play around with this to tidy up a little bit, maybe we could talk about average time on website and do other things to make it kind of one to one. So there is some kind of truth out there maybe. What's the problem we're solving? Well, if this is the true relationship, so this f here is the true relationship. x is true, y is true, what I mean by true is it's reality, we're not guessing anything. So x is a true known thing, y is a true known thing. And then f is the true connection, the true relationship that connects them both. So what we do in supervised learning, is we come up with an f-hat, that gives us a y-hat, depending on some x. So f, we're estimating f, the relationship. I'll just give you, an example in regression. So if, if this is a true connection between age and time, I don't know we could draw it however. That's the true connection. So that's really how it works. Then maybe my estimate is just a straight line, I estimate maybe it's like that. That's my f-hat. So my y, so if this f were really true then all of the points that I could ever measure would be definitely on this line because it's a true, real, genuine relationship. So that would be giving me my y and my red, my estimate would be this little y-hat, which for the same position in x would be down here. That's obviously worse than the truth, it's above a bit below maybe. So there's some error involved here. Okay, so, okay so this, this true relationship, that I don't know, right. So I don't know how age really relates to how long he's been on my website. You know there's billions of people on the planet. And the amount of data I would need to know the true relationship is everyone on the planet would need to have come to my website, spent whatever time they want to spend on it, and I would need to record it for all these billions of people and that would be the kind of the true relationship. Even then you might wonder, well what if someone new is born and they behaved a little bit differently than other people. Maybe your relationship is a bit out of date still there. So it's really hard to even have a sense of what the true relationship is but we can at least imagine there is one, and we can see here that our red line is actually a little ways out. Okay, so what's the goal of supervised learning? Just to make the red line as good as possible. How do we say that? Well, there's lots of ways of saying it. One way of saying this that we wanna minimize, you know, how far away our estimate is from the truth. So there's, there's one formula. I mean it's the same thing to say by the way. So let's try a different color. But it is in fact just the same to say minimize how far away my prediction is from the truth. And there's bars there, you could read those bars they'll just say in any direction. So, you know, those bars mean the absolute value but you know if so, if the prediction's three and the truth is four that will be minus one. So be out by minus one but if I put bars around that, that gives me one, it ignores the sign. And then if the truth were four and I put three, that'd be an error of one and that would be still one. So you could see here these bars are just saying, like regardless of whether I'm above the line or below the line doesn't matter where my predictions are. I just want the distance between these two things kinda relative distance between them to just be as minimum as possible. So you can see here on this diagram that in this region the red line is actually above. So this will be like saying, you know this will be like saying, that the prediction here is three but the line says two. And here the thing's lower. So this would be, maybe the line would be here, brrr I don't know, 50 and the truth here is 55 or something. So that would be above. So in one case, you know you will sort of the truth, so your prediction minus the truth is minus five, and down here it's plus one. But we don't really care about, you know, where the gap is, we want the gap to be as small as possible. Okay, now, fine. So this formula is sort of defines the goal of supervised machine learning. The goal is to, you know, find somehow this estimate function f-hat, which is as close as possible to the true function f.


Unsupervised Learning - Finding the Model with Linear Regression Part 1 - Finding the Model with Linear Regression Part 2 - The Semantic Gap - Approximating the Truth Part 1 - Approximating the Truth Part 2

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.