Issues with Machine Learning – Approximating the Truth - Part 2
Start course
1h 30m

Machine learning is a big topic. Before you can start to use it, you need to understand what it is, and what it is and isn’t capable of. This course is part two of the module on machine learning. It covers unsupervised learning, the theoretical basis for machine learning, model and linear regression, the semantic gap, and how we approximate the truth. 

Part one of this two-part series can be found here, and covers the history and ethics of AI, data, statistics and variables, notation, and supervised learning.

If you have any feedback relating to this course, please contact us at


But let's suppose that... Let's suppose that this function does exist, right? So what does that mean? It means that actually, maybe, that there is a... With this information we have, that we can actually solve this problem. So let me give you an example of that. I don't know... I mean the progression of a disease, right? That would be a possible example. Like, here's a formula for how a disease progresses, so here's the truth, right? So the truth is... And "x" here is number of days, so "x" here would be... Sort of days from, say, first case. So maybe we have this for the UK. And the formula would be something like... Some number... Two to the power of some other number, that would be... Maybe even plus "c" or something like that. So that would be a very general formula, and the "b" there, that little "b," kind of gives you a doubling rate, approximately. So you can... And then "a" is something, and "c" is something, so, you know, that's a possible formula for the truth of a disease spreading, infection rate spreading. So we just put some numbers in there, so if I put zero in there, what that gives us is - You've put zero in for the "b" - Sorry, where's "x" gone? That should be next to "x" like that, you see. It means... Just to make this simple, let's just say "a," "b," and "c" are - Let's get rid of "c"; let's just say that's zero, just for the sake of a quick calculation. This is zero. And then let's just say "a" and "b" are one, so if "a" and "b" are one, if I put zero in for "x," that would just be two to the power of zero. If I put in one day - Or let's put two days in. If I put two days in, that'd be two to the power of two; this would be four. If I put three days in, four days in, five days in, you can see it's gonna go two to the power of three is eight. Then two to the power of four is - Well, eight times two is 16, and then you were 32, and then you can see that it's going to quickly get bigger. If you draw the graph of that, you get this sort of, that's not quite that shape but sort of that shape-ish. Map. Okay, so what's the point of all that? The point of all that is... Let's just imagine a situation which - sort of a true answer, and this would be a pretty good one, right? So if I were approximating the truth here... If I was approximating the truth, I'd be coming up with an approximation, "f" hat of "x", my approximation is gonna be... Well, let's say it's... Well, if I use a straight line, that would be catastrophic, right? That would be a terrible approximation. Let's just do it. So if I say, "Here's my approximation..." So let's just say that's gonna be, like, two "x" plus one. There's the formula. So that would be really bad, okay? It'd be really bad, but... Maybe it would work in some situations. Maybe it would be all right. I mean this particular one, this particular red line, might not be so good. Let's think of a situation in which it would be, maybe, all right. Well, down here this is kind of flat. So that's maybe... Maybe let's say that's our approximation. So "f" hat of "x," that's just... That's "x" plus one; that's a reasonably accurate formula. Now, okay. So there is a truth; there is. We have the approximation; there it is. What can go wrong here? Well, maybe you can start to see from this diagram what can go wrong. Well, the approximation does really really well - Let's maybe put it in green, actually. It does really well in the early days of the disease. So in this region here, let's say this is up to day 100, it's really - Or even day 10 or something, I don't know. It's doing really well. Nothing's changing very much. I think it's growing very gradually; it looks kind of flat. And if you look at cases, you know, in the wild, the early days of an infection, you get, sort of, three cases one day and then seven cases another day. It doesn't seem very... It seems just chilling, you know, up and down. Then it goes catastrophically wrong, right? So maybe we say it's okay up to day 10. And then we're gonna go - Then the thing takes off, right? So in this region, if we use our red line to do our predictions, we have this massive, massive... catastrophe really, right? So if we using this flat red line to plan for disease progression, and we... And if that's the red line we use, then around day 10, all of our planning system's gonna go completely wrong, and the whole thing will fail. Right, okay. So how do we characterize the issues here, then? Well, okay, good. How do we characterize the issues here? Well, what we can say is that, within some region... Within some region - Let me just draw this a slightly different color. Maybe... Zoom in a little bit. Come on, all right. Within some region, sort of this region here, it's all right, actually. So there's a green region, isn't there? There's a sort of green region. Then there's a red region. So what is the green region? Let's give you some words for these regions. So this is the green region, and we've got this sort of red region. In the green region, our training data, which is the data we use to draw the red line... So if I were to redraw this diagram in a slightly bigger way. Presumably, like this is the data we've sort of seen. We haven't yet seen the future, so if I do the future just off here, that's gonna be the future; we haven't yet seen that. But we have seen up to this line. And then what we do is, we fit a straight line to it, and we go, "Okay, well, it looks flat, so it is flat." Right, so within this sort of green region, we have what we can call the training data, or we can call it the "in sample." So this is sort of training data, "training data set," or the "in sample." Remember, this idea of a sample... Maybe we'll formalize it at a slightly different time, but we can think of a sample - We can think of, you know - If there's millions of cases, or if there's thousands of genuine cases, this sample is just some kind of subset of those cases. You can think of "sample" as meaning kind of like a subset of the total number of cases. So it's a little - You've got this little subset of, maybe, 20 cases you've looked at, and, over the number of days you've looked at them, they just seem flat. Now, what's the out - What's the other kind of thing? Well, the other thing is the out sample. And that's the rest of the truth, basically. So that's everything else. And when I say everything, I mean future, past, everything. I don't mean just now. The out sample is everything we could possibly know about how this disease is gonna change. So the out sample is the whole data set, so that's sort of, like... Well, it's everything we haven't seen. So everything we haven't seen is this thing going off in the future; we haven't yet seen that. So we've got this in sample, things we have seen or the training data set. In sample's a better way of thinking about it right now. I think a sample of stuff that we have. And the out sample, the stuff that we haven't yet seen, everything, right? Okay, good, so we've got these two, sort of, regions of the graph or regions of the problem, in sample, out sample. So we can define the issues here just in how these two things relate to one another. So, you know, what could happen? What could happen? Well, the in sample could look like... The out sample. We could do it. So, in this case, if I say, you know, "This is the things I've seen. Here's my straight line." Well, in the out sample, the stuff, let's say, over here, the future could just continue like that, right? So it could be that the in sample, the stuff that we're gonna see early on, does look like the out sample. And if that's the case, case one... If that's the case, great! We can do machine learning; things work smoothly without intervention. If we wanna look at this a little bit more formally - Maybe see if I can give you some... a slightly more technical way of putting it possibly... If we imagine that... If you're gonna call this, the in sample, "x" - Maybe do it a capital because there's lots of - possibly lots of columns, but - Let's say "x" in and "x" out. Well, that's gonna be "xy" in, actually, so the actual stuff here... Let's call this "the data set for the in" and "the data set for the out." The data set for the in is gonna have some distribution. Let's call that "p." And the dataset for the out is gonna have some distribution. Let's call that "p." "P in" and "p out." And what we need, really, is that "p in" is sort of gonna be, pretty much, "p out." Now, if that's a little bit too technical for you, what this notation means is just exactly what I've said, so the distribution here, this little distribution term, just means how the data's laid out. You can see here that this is just all flat, and this one here's all flat, and this is sort of just a more technical way of saying what we need - The layout of this data to be the same layout here of the distribution... Okay, if that's the way things are, great. What that means, intuitively, is that the place that we have collected the data from and how we have collected the data is actually going to reflect the thing we're trying to predict, now and later. There's not going to be any sort of variation. As above, with the cats and dogs, I mean if it really is the case that all of our dogs are photographed in mountainsides, sure. Then we can actually use that system because when we trained it, when we learned the model, when we drew our red line, we used whether the photograph had white stuff in it, okay. But when we come to actually use the system in the future, all dog photographs of dogs will have little white patches in them anyway, so it works perfectly. If the future looks like the past... If when we use it and the area in which we deploy it, the situation of its deployment, looks like the situation that's training, you know, great. So another way of saying that would be, maybe, rather than in and out, we could even say "the data set for the train..." So train there. And we could do "data set for the prediction," so that would be another way of saying "in" and "out" kind of. Like, the sample of in is the training time sample, when we're just finding the red line. The data set of the out is the prediction time, when we're doing predictions. Right, okay, good. Well, what happens if it doesn't - The in does not look like the out. There's some really big difference. So in this case here, you can see that, really... The out is very very different than the in. Well, that's potentially catastrophic. "Cat-as-trophic" or whatever, right? That's gonna - If you just turn on a system, you let it chug away and do its thing, it'll cause chaos. So it doesn't mean we can't - It actually doesn't mean we can't do anything in this case. But it means that, at the very least, what we would be - The way we would use the machine learning system, when the data we had looks different than it will in the future, is we might use it in this assistive way still. In this assistive way in the sense that maybe what we could do here is... Possibly have a human operator, possibly the machine says... Well, in this case, the machine says, "There'll be 10 deaths or 10 infections, 10 infection, 10 infection, 10 infections." And then the human operator starts things - Things explode here, and the human operator then steps in and says, "Well, actually, the model is now no longer working correctly. Let me investigate." And maybe even that breakage, that failure of a model to predict... That can be an investigative tool itself. So the fact that the future is different somehow, that isn't accounted for in the in sample, those things that we've seen, that can be a tool. And even when... Even when the assistive system, the system that the human operator is thoroughly verifying and working with - Even when that is going less and less accurate, that can still have some use, so it might not be as extreme as this case here, right? So what could happen? A less extreme version would be... Well, I don't know. Maybe that's the truth, and... What could we do here? This is what we have. So if that's the truth, then this is what we have. This is what we've learned, okay. So here we're doing bang-on. And as time goes on - It doesn't need to be time; it could be "as the age of the user increases" or "as the time goes on," or whatever it may be, we get worse and worse. Worse and worse. And this is... This is a sign of something really going wrong that our data doesn't tell us anymore about what the problem is like, really. But, you know, can we still make money over here? Can a human operator still use this? Possibly. You know, when things - If we started using it here, and we were like, "Oh, everything seems fine," and we were monitoring the system, and, in our monitoring, we saw a greater and greater divergence between the predictions we were making and, kind of, what we've observed to have happened... We predicted this, and then we observed this happened, and it was a bigger difference than what we predicted from what we've actually seen happening, that's when the human operator obviously has to step in and go, "We need to make a decision about whether this is useful or not." Right, okay. So let's summarize where we are then. So these are all really important issues to be very mindful of as a practitioner. The first one is the somantic gap. "Does the information I have contain - Is it the right source of information, even, to distinguish the things I'm gonna distinguish," and, you know, most cases I would say it probably isn't... Probably isn't sufficient information anyway. That doesn't mean it's not possible to do a very good job; it just means that... There will be mistakes inherent in the system, and they can't - You can't really get beyond that because the data you have is insufficient. Then the second question is, "Okay, well... Even if I could do it with the data, are there still problems?" Oh yeah, sure, there are still problems. Do you think the kind of data you have now is the right kind of data? Fine. But, in the future, or in some other environment in which you're going to use this system, will that data look different, right? Maybe some things are a matter of visual information. Maybe, you know, a stop sign, right? A stop sign. If I wanna figure out whether a sign is stop sign or a go sign, do I need anything else other than a photograph? I don't think so. I mean because a sign is almost defined as being visual information. So if I'm trying to distinguish between stop signs and go signs, you know, all I need is a photograph. Fine. But is there something that can go wrong? So the somantic gap is kind of not a big issue anymore. It's just visual stuff; a photograph is fine-ish, probably. Is there something that can go wrong here? Well, maybe I'm gonna use this system in another country, where their stop signs are different. So I actually have the right kind of information; it's visual information. It could, in principle, solve the problem, but, actually, when I come to use that kind of information, it turns out I didn't have the right set of it. Like I had only my in sample, and what I needed was more of the out sample. The out sample is, obviously, the infinite size of stuff that I could have. So issues abound there, and what this can end up meaning is that, actually... What can only be built in many cases, or some cases, is assistive systems. Not necessarily automated systems, that can just continue making decisions for themselves, but assistive systems, which require human interaction. And the human there is far more intelligent, far more capable, than the machine. And that's this really, I think, important takeaway point from this section is that a machine learning system is very dumb. It's very subject to the mistakes of the practitioner, their choice of data, their priorities, the way they consider the problem to go, and it is very easy for someone with no statistical knowledge, no knowledge of machine learning, no knowledge of programming or anything of that kind, to come along and say, "Well, actually, this is wrong," because they know the problem better. They know the environment of the problem better. They know their customers better, the films better, the genres better, whatever it may be. And it's very very important... Well, as a society, but also as a practitioner, that we don't end up in situations where, merely because we have labeled some system intelligent, that we believe it is intelligent because it isn't. It isn't. And such a notion will lead to really quite... Possibly quite catastrophic results. Massive loss of money at the very least, but possibly also, in cases of disease spreading, possibly massive loss of life. The difference between models which are accurate and models which are inaccurate is a really big difference.


Unsupervised Learning - The Theoretical Basis of Machine Learning - Finding the Model with Linear Regression Part 1 - Finding the Model with Linear Regression Part 2 - The Semantic Gap - Approximating the Truth Part 1

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.