Issues with machine learning – approximating the truth - Part 1
Practical Machine Learning
The course is part of this learning path
Machine learning is a big topic. Before you can start to use it, you need to understand what it is, and what it is and isn’t capable of. In this module we’ll start with the basics, introducing you to AI and its history. We’ll discuss the ethics of it, and talk about examples of currently existing AI. We’ll cover Data, statistics and variables, before moving onto notation, supervised and unsupervised learning. Finally, we’ll end off by going into some depth on the theoretical basis for machine learning, model and linear regression, the semantic gap and how we approximate the truth.
- So let's try and bring together then, taking this example of photographs and the problem of classifying them into cats and dogs. Let's take that problem and use it to illustrate a couple of the issues around machine learning. When it might fail, when we might struggle to do well or perform well, or get a good model that solves the problem accurately or efficiently, however you wish you phrase it. Right, so let's give you the two the two headings of what these problems are, and we'll go through each in turn. So the first one, which sort of naturally follows from this conversation that we've had just now, is semantic, the semantic gap. And what we've been talking about really, in terms of the kind of distance, metaphorical distance in a sense, the gap between the data we have and what we're trying to learn in some sense. Why I want to know about cats and dogs, all I have is photographs. Is a photo sufficient information to understand the distinction between a cat and a dog? No, it isn't. So there's some kind of a gap between one the one hand, the relevant features or properties that objects have in the world, in the case of cats and dogs, DNA, bone structure, behavior. There's a gap between those properties and the data we use to encode them or represent them. Visual information does not fully encode that information. That's the first issue, and we'll talk about it in more detail. The second issue is approximating the truth. And just to give you a bit of preview of that idea, it has to do with how well we are doing in coming up with this model. If we knew what the truth were, we could do a good job, but since we don't, what things can go wrong? All right, so let's talk about the-- so, we can think of the this way if we wanted to, right? A semantic gap says, well, actually, maybe this function, this relationship between the data we have and what we want to know, maybe that it doesn't exist. And that's the problem with the semantic gap. Part two here, is okay, well, suppose it does exist, suppose, in fact, I give you information about someone's heart rate, and you can tell me, I don't know what can you tell from someone's heart rate? Their blood pressure, probably not, but you could imagine some kind of relationship a scientist might investigate, in which the data that they are collecting is in fact sufficient for predicting or understanding or explaining the thing of interest. So, if I tell you the voltage in a circuit, and then current, you can tell me the resistance of pieces of that circuit, without any more information. And that's true. So, the first one is maybe the relationship of trying to approximate doesn't exist, and the second one is, well, what goes wrong, even if it does exist? Okay, so let's talk about each of these in turn. So the semantic gap. So as I said, with this one, this is really about the data we have, not encoding the necessary sort of information that we'll need to solve the problem. So let's try and make that as intuitive as I can. It's a relatively philosophical issue, but it's actually a fundamentally important one. In what sense is it philosophical? Well, I think when you try to solve a machine learning problem or you trying to solve maybe in science or statistics, or any area of where you're collecting data about the world, one of the things you need to think about is, what is the world really like? What properties of things actually distinguish them? So, if I, let's take a more extreme example from cats and dogs, perhaps, if I'm trying to predict whether two people will fall in love, say, so I have this model, and my inputs into the model is some data, data from person one, data from person two, and what I want to know is will they fall in love? Plus one to yes, minus one to no, and that would be my prediction. So it'd be a wife, it'll love, it'll heart like that, you see. Now, what data am I gonna put in there? That's a real question, right? I mean, one thing I could try is, I could try, how much the, well let's not use the same symbol here, but we could say, the number of text messages they send to each other each day. Maybe we imagine to have gone on a date or series of dates, maybe this is a dating website, and we're collecting data somehow from their handsets, or we're not actually looking at text messages, we're looking at the frequency of messages on our site, okay? So this is a dating website set up, that's a person, so we've got one input, text messages from one to two, so and then this one's gonna date, so person one to person two, this is gonna be messages from person two to person one, right? So, okay, fine. And what I'm trying to do is predict whether they'll fall in love. Well that's mad, right? That's just mad, imagine a world in which you could predict, whether two people would fall in love, knowing only the number of messages one person is sending to the other and vice versa. I mean, that would be-- what kind of world would that be? That would be a world in which human emotion, human communication, human connection, was simply a function of well, how often you talk to each other? Is that right? I mean, it doesn't seem right. There's a massive semantic gap between that information and whether people fall in love. But it doesn't mean that we can't actually produce a model like this, right? We can make this prediction. The prediction probably won't be very accurate, but probably more accurate than a coin flip. I mean, the rate at which people send messages to each other is relevant probably, it can be used to make better than random guesses at the issue. But it's a really, but hopefully in this case, as with the maybe the photograph case, a little more tricky. But in this case, hopefully you can see there's this massive gap between the information we're using, and the actual relevant features of the world that actually distinguish things. And in my view, personally, my personal view is that, machines will never bridge that gap, in the way that we currently program them any way, I think machine learning is activity of bridging that gap. In other words that what is it to not have a semantic gap? That's due to science, basically, right? I mean, why is that? Well, if I want to figure out under what conditions people fall in love, that would be many lifetimes of research. You need hundreds of researchers with extremely diverse fields of neuroscience and possibly even sociology and who knows what the, psychology and empirical psychology and many areas, all coming together, giving you lots of different insights and different layers of the problem. And that would to, most of the information they'd collect wouldn't be useful, and it would take a very long time to figure out what actually was relevant and what wasn't relevant. And that question of relevance, and how it connects to the problem, really requires a sort of person to be there really, is its right, so that you need to sort of be in the environment, be trying out lots of different things, see that something's failed, try different ones, you need to be in this kind of responsive relationship to the environment, to figure out which bits of the information are relevant. Machines are not in that situation. Certainly machine learning is not a technique to put them in that situation, it's a technique in which data has been produced, it's far less than the amount, it's far less than in an infinite amount of data, so, it's a relatively small amount, even in the case of billions of images, that still might not be enough to describe, for example, all the roads in the world and all of their situations and all of their circumstances. So, of all the information, vision information you could have, what you end up, with the machine ends up seeing or analyzing, is a relatively small percentage of all the information it could have. And then secondly, the problem is, of course, that the information you provided, doesn't actually, it doesn't sound relevant. It's a little bit relevant in a sense, but it is not semantically relevant in the sense that, visual information is not sufficient for describing the key parts of the problem. So, a machine just are trapped in that, right? They're just fed in this information, they analyze it, and the models that they produce, they try to classify cats, dogs, love, not love, whether the car should turn left or right or whatever, they are not exploratory devices, they're not curious, they're not refutative, they're not scientific, they're not trying to form hypotheses and refute them and think through problems. They're sort of trapped within a kind of blind repetition, of the things that they have seen before, generalized a little. And if that information is not sufficient, to actually solve the problem, and yeah that's that. So the semantic gap is one, is one really important thing to consider, is it even possible to take this information and solve the problem? And oftentimes in this sort of more principled sense, it won't be. And so the question then is, well, if I can build a system that does in fact, kind of correctly to classify cats and dogs, why is it doing that? How is it doing that? Let me preview one issue and then we'll move on this one. So there was a, there's a case here of, a machine learning system taking in sort of pictures of cats, whatever we have or wanted to have, and pictures of dogs. And it turned out that, I'm not sure how I wanna draw a dog, but whatever, it turned out that, all the pictures of the dogs, were shot on a sort of mountainside smelly, had a snow in them. And the way the machine was solving this problem with classifying the cat and dog, is basically saying, Well, if there's, large white regions of the image, it's a dog. Well, that's not right. As soon as you put an image there that doesn't have white regions in, but contains a dog, you will get the wrong answer. And in fact, I think, we need to be kind of attentive to the fact that actually, there isn't a wrong answer anyway. Okay, what we could do is, we could make sure that the system is looking at the forehead, and the eyes and the visual structure. And that would give us a better system, if you put a different kind of dog in, hopefully it would work still. But because vision information is insufficient for solving the problem, there will be some number of cases, where this doesn't work at all. And then the question for you as a practitioner is, okay, let's think about those cases, where is a system being deployed? Where is the system being used? Are we likely to encounter those cases? I mean, using snow as a guide to whether something's a dog, is a perfectly reasonable solution, so long as the situation that you deploy the system, always has dogs on any background. I mean, if that's always going to be the case, there's no need no really care that much, but it does highlight how what we're not doing here is somehow using this background notion of dog, that we have as a practitioner, and as a human being. And the machine is not using this notion, and then somehow you'll be applying that to images, that have no such notion. It is just using pixels , ratios of pixels and combinations of pixels to distinguish things. And that, in principle, that well, would be, it doesn't solve the problem. Okay, so that's part one, okay.
About the Author
Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.
His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.