Issues with Machine Learning – The Semantic Gap
Start course
1h 30m

Machine learning is a big topic. Before you can start to use it, you need to understand what it is, and what it is and isn’t capable of. This course is part two of the module on machine learning. It covers unsupervised learning, the theoretical basis for machine learning, model and linear regression, the semantic gap, and how we approximate the truth. 

Part one of this two-part series can be found here, and covers the history and ethics of AI, data, statistics and variables, notation, and supervised learning.

If you have any feedback relating to this course, please contact us at


Right, so let's have a look at some issues and problems surrounding machine learning. Maybe now that we have a kind of idea of what's going on with it. So, we talk about WiKi perspective, which is maybe we have these rules that are specialized by data. We've looked a little about breaking down the problem in terms of target feature. You're predicting one thing or another thing or you're trying to analyze connections or patterns between things. Let's have a look at this a bit more formally then from the sort of statistical perspective. So a more statistical perspective. What is your doing? What is the machine doing? And maybe, what are some problems with it? So, let's, you know, do this app one more time. This is gonna be the supervised ML setup, which is, you got these targets, these features. We're trying to estimate something. How are we estimating something? Well, we're trying to find an estimate function f which takes what we know, gives us the guess, that's how we do it. And maybe we imagine, we imagine maybe that somehow there's a real function out there. We don't know what that is, so what we do instead is we use just a data set, which is just the x's and y's. So we just use x and y. So we hope that somehow this is drawn from the true function. So that when we use it to approximate somehow, everything sort of works out. So we know we've got this sort of target, feature. Let's call this the estimate for the target. You've got this thing here now. We're gonna give it a technical name. I'm gonna call this the model. Model, and, that's really the set up. The f, without the little hat, we can call that the true. People call this grounds, true, or the true model, the true, whatever it is, the thing we're trying to model. Now, may be while we're here we can see how, we can just see how, we can just plug this into the weak AI notion. So, go back to the beginning of all this setup. We talk about tuning rules. What are we doing here? If we build an AI system, we end up saying something like well if the model, given what we know, gives us a prediction of some kind. So, maybe that comes out with a positive number. Then we say it's cancer or that we turn left or that it is a good film that we like the film or you know whatever it is. Maybe we won't put cancer next to the word like perhaps. We just go for something a bit more, less callous assuming perhaps. Say like and otherwise maybe we predict dislike. So, there's a way of sort of slotting in this model. We are going to call this a model now. The way of slotting in this model into this algorithmic picture of, "If this then this, then this," which is sort of a model of computation that we have. Computation is just algorithmic like than, and then statistics is how we're-- it's just fitting into the algorithm. It's just fitting into the evals structure, and it's going to give us the values and the tests, the propositions, that allow us to tune the system to make it fit some data set. Okay, right so that's what we're doing as we're finding this model, finding these values, building this little algorithm. Some issues maybe, some issues. Let's talk about this model, because that seems kinda harder the matter, isn't it, this model. The model is f, you know it's going to have x as the input, and it is going to give us our predictions. That's the idea. As an example, a linear model would be like 2x plus 3, and what we would be doing then to solve the machine in the problem is come up with 2 and 3 that gives us, you know, our y hat that gives us our prediction. You put in 10 years old, you get 23 for your rating or whatever it is. That's the only problem, coming up with this model. So, this is the model, and we would say actually that these numbers here, we'll talk about these in more detail later, but these are the parameters of the model. These are the things you have to find to-- these are the pieces of the model that you're finding or however you want to say it. Now, what do we hope? What are we hoping here with this whole approach? Well let's think about this. What's x? X is a feature, right? Let's choose as a feature something a little bit more interesting than we have chosen before, let's choose an image. How do you have an image as a feature? Let's think about that just for a second. Well, what we could do is just treat this as a grid of numbers, and lets put a little grid over there. So, in blue see I put a little grid, and then that gives us pixels basically. And if, you know, a pixel is black we a 1, and if it's empty or white we put a 0. So, inside this little area here we have a 1 for 1. So the numbers here are 1, 1, 1, 1, 1, 1, 1, by then you have a little smiley face, and then everything else is 0. You might suppose, you know it doesn't really matter how that plays out you get the general idea. Now okay, so what is this? The image is actually isn't a little x, because it's not just one number now it is a big capital X. It's many features, it's many numbers. So this capital X is a matrix, and it's just this matrix, you know, smiley face like that you see might work out and you put zeroes everywhere else, and that's your capital X. And what's our model gonna be? Well, we could have a linear model. So, what we could do is just find some number to multiply each of these entries. So, you'd maybe do 2 times this 1, 3 times that 1, 3 in a different color maybe. If I do 2 times this entry, 3 times that entry, 4 times that entry, whatever right? And maybe the model would be, you know, if all of it adds up to zero and maybe some of these numbers negative, some are positive. If you multiply all the numbers, add all the numbers together and you get zero or above, maybe that gives us the face is happy. I'll just write that down for you. So, maybe we say something like well our y here is going to be either +1 or -1, and we're going to take each pixel just is a row. This X1, X2, X3, and these are all going to be zeroes and ones remember. So it is going to be on, off. So this is like 0, 1, 0, 0, 0, and you're just gonna multiply them by some number. 2, 3, 4, -1, -2, -3, just multiply them, and the question is if all of this together gives more than zero, then we predict that the face is happy, if it's less than zero we predicted that the face is sad. That's the thing we've learned. Okay, that's what the model is. Take a grid of numbers, do your multiplications, do your additions, do whatever you need to do. If it comes out positive it's happy. That's how we're going to predict happy or sad for our images. Right, so you can imagine a problem then where what you have is a set of images of different varieties. Maybe other than happy or sadness, maybe you could have dogs and cats, that's a classic one. So, maybe here we have-- well I don't know what I'm doing here really, a face of a mouse of some kind possibly or something, and in here we have a little cat or something else, I don't know. You can imagine just this big set of images, and they each come in as a particular big capital X, big capital X1, big capital X2, you know for different images. And the idea is the machine somehow is going to distinguish between them so if we just look at some axes again. If we just consider one pixel what we might hope is that there's one pixel, let's say this little pixel here, that if that pixel is on, so maybe we say that pixel is on and we hope that's a cat, and if a pixel is off let's say it's a dog. And maybe if all of the cats have a pixel on and all of the dogs have a pixel off, then maybe we don't need to consider any of the pixel. We could just solve the entire classification problem now since it's a binary classification problem. We could solve the entire classification problem just by looking at the pixel. Is it on or off? So, the model there would just be something like-- and if this pixel was X351, pixel 351, you would just say well if that pixel is more than zero, if it's on maybe, then it's a cat. And if it is less than zero, we say it's a dog. It would be a very simple thing, but presumably it's going to be some combination of pixels, you know, a bit of that pixel, a little bit of this pixel, a little bit of that pixel, a little bit of that, and if they're all on then something is probably a cat and it wouldn't be perfect. So, there would be a mixture of cats and dogs and maybe we just need to do best we can. Right, so that's an interesting problem to consider. Right? Interesting problem to consider. Set of cats, set of dogs images. Can we come up with some way dividing all of these pixels together and coming up with a number that hopefully this number gives a sense of a cat or dog. Right, that's the set up. What's the problem? What's the limitations of this approach, this supervised machine learning approach? What's the limitations of that given this set up? Well, what's this model? Well, the model is a little f, a little prediction function. Let me just zoom back in there, and that model is going to take in this input, a big image like that f, and you give a little prediction for whether that image is a cat or a dog. Let's characterize this f, this model. So, this is a model. What is it? Well, it's somethings that takes input, photographs or images. I'll say images, that's at the input, and it takes it to an output, which is you could say plus or minus 1 or you could just say a little set, +1 or -1. And that's kind of like a way of how to define the function, like this is a set of images, this image is a set. This image is a set taking every image to a plus or minus 1. Right, okay. That's what the function does, f hat. Now, for this to work out, in other words, to get a good estimate for every image or to get the truth of every image like we're checking if it is a cat or dog. Recall, that we need there really to be a function, f, such that when you give it any image it'll tell you whether it's a cat or a dog. You know, what we're trying to do here is kind of like estimate this stuff here, estimate the true stuff. Here's the problem, it's very unlikely that such a function exists. It's very unlikely. The reality is such that you can take images and tell whether things are cats or dogs. Let me show you what I mean concretely by that. Well, here's an image, let's put some little ears on, and let's put some little tree in the way. Oop, get that out of the way. There's a little tree. Noe probably, maybe wouldn't be a tree there, maybe a little bush otherwise it would be a very large dog or something. This image could both be a cat or a dog. In other words, it would be possible for me to take a camera to arrange a scene and to produce an identical photograph with a dog and with a cat in that photograph. What does that mean? It means that a photograph is insufficient data to tell whether something is a cat or a dog. So, this what I've done here, I've occluded the item there, and so I can't tell what it is. What other problems could there be with a photograph? Well, maybe I could shave a dog or a cat to look identical. So, the identical photograph would be you know-- so I just set things up so that there's a particularly fluffy dog or I arrange a particularly aggressive looking cat or something, and maybe this identical photograph, one taken with a chihuahua or something, one taken with a little cat, look identical. Occlusion is more plausible right? More plausible that maybe we're just looking at the ears or something you could think that well that could be either a cat or a dog. The point is this, the photographs are insufficient data for telling whether something is a cat or dog, and in fact they can never possibly be sufficient. So even with an infinite number of photographs you couldn't know, because if I had all the photographs that I could possibly take in the entire universe in every second of every moment, all dogs in every position, all cats in every position, everything every position could possibly get there would still be a photograph, just like the one I've drawn down here, which is ambiguous, which doesn't resolve the matter. Visual information does not distinguish cats from dogs. What else could I do? Okay, suppose I give you an infinite number of photographs. I'm going to give you an infinite number of two dimensional photographs, but you'll tell me that's not enough, because there is no relationship. The f of a photo does not tell you, cannot tell you, whether something is a cat or a dog. There is no such relationship, it doesn't exist, not possible. As a side point, I mean if we're trying to estimate this thing I mean it's a pretty tricky thing not to exist right? That tells us that maybe we're in trouble though, maybe we can't build a system using photographs to estimate cats and dogs. Maybe we can, maybe we can't. We have to sort of thing about that.


Unsupervised Learning - The Theoretical Basis of Machine Learning - Finding the Model with Linear Regression Part 1 - Finding the Model with Linear Regression Part 2 - Approximating the Truth Part 1 - Approximating the Truth Part 2

About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.