Practical Machine Learning
The course is part of this learning path
Selecting the right machine learning model will help you find success in your projects. In this module, we’ll discuss how to do so, as well the difference between explanatory and associative approaches, before we end on how to use out-sample performance.
- Let's now turn to the concern of model quality and performance in how we're gonna select and choose between models. So let's call this performance. Question of performance. And really, it's about out-sample performance. So you can think of this as being out-sample performance, we don't care how well the model does in the thing we've already seen. We wanna know how well it's going to do and what we've yet to see. So if we think of the world, if you think about the world as being one big blob of data, like that, then there's gonna be a region of that data set that we have seen in sample. Probably much smaller than that, to be honest, right. If this is all the people who have ever committed fraud that we could ever possibly want to know about, we've probably only seen this small group of people. That would be our in-sample. So all of this is is our out-sample. This is essentially all people who we would even be considering and what we would hope is that in the in-sample, there's some people we see that have committed fraud and those are in red and there's some people who haven't committed fraud, those are in green and what we would learn is some decision boundary that goes through, through what we have seen but it's meant to generalize, so it's also meant to split the out-sample in two as well. So if we see someone in the future, where we don't know what they're status is, we would hope that if we've classified them as a person committing fraud in the in-sample that they should also be people who commit fraud in the out-sample. So in other words, that the pattern continues to hold. Now, if the world looks actually very different in the out-sample, so maybe in the out-sample the truth is actually more like that, so this is actually the truth, say and that's just what we predicted, then what's gonna happen? Well, there's a certain region, this region here, where we're gonna predict people who have committed fraud where they haven't. We've said that people above this line are people who committed fraud and people below the line who aren't. There's this region here where we over predict fraud. So that question of over prediction or under prediction is a sane one. Now, what we gonna consider before we go into these issues in detail, we're gonna consider just our high level, the nature of classification firstly anyway and how issues like this arise. So let's just take a visualization from the sklearn website again, in which we're comparing classifiers. So here is, zoom that, just a little bit. So here is a classifier comparison. Here's a classifier comparison and let's explain how this works. So on this side here, we have our data set, so that's kind of like this circle here, I've drawn, drawn here on this square, which actually makes a lot more sense, in that position sometimes, like to draw sets as circles but mostly we visualize data sets as squares, don't we? With a graph. So here's the data set coming in and you can think of it, so that the true-- the true function which separates these, it's a half-moon function, so in truth this data set kind of has that shape in the red and the blue. Let me zoom in there a little bit perhaps. So I'm not sure how else you'd want to draw that but you could see it maybe just as a kind of curve doing that as well, I suppose. So that's the first extract, data set number one. So data set number two is circular so their true function is a circle in the middle and then the outside is red so below is blue, and inside here is blue and then on the final one I think this is just a linear decision boundary. So on the left we have blue, on the right we have red. Now going across this visualization, we have different algorithms. So we have, algorithm one, algorithm two, algorithm three. So in algorithm one, we have the nearest neighbor's algorithm, which is familiar to us and what this algorithm has done is learnt this particular boundary by considering a particular number of neighbors, you can see where it's going to classify red and gonna classify blue. So this is quite an interesting boundary, it's quite irregular. It's learnt, there's a little pool of blue there and it's very, very insensitive to the reds here which is interesting. What's it done in the circle, in the circle it's kinda learnt the middle a bit and here it's what should be quite a straight line, it's very curvy here, it's cause they didn't have a lot of data so it's not hard for it to decide exactly where things are. Right, you got nearest neighbors and the going across, you have lot's of different algorithms, you've got an SVM, a radial SVM, a Gaussian Process, a Decision Tree, Random Forest, Neural Network, Naive Bayes, so we'll be having a look at some of these throughout the course but what I want us to observe right now is how radically different these decision boundaries are. I mean, a world in which the data is distributed this way according to the boundary that this algorithm has learned, the SVM. It's a very different world than this one. Think about it, if these were meant to be explanatory models, if this was a scientific process, consider what this particular solution says. This solution says that in this variable, and in that variable, the truth is that people below this point are red and people above that point are blue and that's that. So that's the sort of hard ground truth, the scientific explanation. There will be a very, very, very specific point at which the world changed. Things below this very specific point are red and things above a very specific point are blue. It's true of certain phenomenon, I mean the boiling of water for example, if this were 100 degrees centigrade, it would be liquid or a gas, so, there are sharped edged things in the world but compare it to this model. This is a very, very, very different sort of model, I mean over here things are red and they gradually become blue and red and blue are in slightly different places, the boundary doesn't seem so hard and it has this circ-- and where we're very confident that print is a very circular structure, this is a very hard, straight line boundary and likewise in the k-nearest neighbors, the boundaries, you know, it's very different all these algorithms are coming up with very different models, very different solutions to the problem. So this is a particular model, that's a particular model and they're very different. None of these is true, this is not science, this is just machine learning and these are really just associative models, they're not trying to be explanatory and the question then is, "which is better? "Which performs better?" Well, since we know the two true distribution of the data here, this data was generated using a half moon distribution with some noise, this one's generated using a circular distribution with some noise and likewise a linear split, we can sort of compare with what we get. We can say, well actually, maybe a Gaussian Process is better for the circular one maybe for the half-moon one a Gaussian process is also pretty good so a Gaussian Process is doing very well here but then for the linear one, it goes all, too curvy the layout is much more black and white than that, very sharp boundary, so what we'd like here maybe is the Decision Tree or possibly this Neural Network, depending on-- I mean that seems to have got it wrong, it seems to have given it too much of a slope, that boundary but maybe we say the Decision Tree is actually the most accurate algorithm here. According to the truth of it, of course. So, we can think of this truth as being this entire data set here. We can think of the data it's seeing as giving it this giving it this particular boundary and then we know what the blue function is so we can sort of look at the blue function here and go, "well actually it should be "like this or that." 'Cause in general, we don't know how the data was generated so we have to use very novel forms of testing and evaluation to come up with what the solution is going to be. In other words, what the best solution is going to be.
About the Author
Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.
His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.