Module 4 - Model Selection
Model Selection

Selecting the right machine learning model will help you find success in your projects. In this module, we’ll discuss how to do so, as well the difference between explanatory and associative approaches, before we end on how to use out-sample performance.  


With the process of machine learning sketched at this point, we can now turn to looking at a key part in more detail, model choice or model selection. Called choice and selection. Selection perhaps, has a connotation of an automatic choice somehow among whereas model choice is maybe a more manual set of connotations to the word, yeah. So, what do we need to think about in terms of model choice? Which model are we going to choose? Well, let's just list some that we've seen so far.

For regression, we have seen linear regression. We have seen a K-nearest neighbors. For classification, we have seen logistic regression and we have seen K-nearest neighbors. And just recall that the regression part of the logistic regression doesn't mean that the problem you are solving is a regression problem, it just means that the technique you are using to solve the problem is regression. In any case, how do we choose?

Even with this set of four we have a choice between approach one and two in each case. And then approach two, the K-nearest neighbors approach is really an infinite number of options there because K can be any number, well not any infinite number, it can probably only be within the size of your data set. But the K can really be a very, you know, so there's lots and lots of options there, isn't there? K1, K2, K3, all the way to K equals the number of data points. So, how do we choose between these approaches?

Well, there are several things you need to be aware of here. Number one is what our data determines. So we can think of the data as having a role in determining our possible space of options. So, if you think of all the options that we have, I don't know, as options is a set of all possible approaches or algorithms we could take. Data is gonna give us just some number of them. So that's when I said you can't use this kind, this kind, this kind.

Consideration number two is a related aspect of the problem. Not necessarily what data do we have, but what do we want to do with that data? What do we want to do with the model? Are we merely predicting something or are we trying to explain it? So that's a consideration we haven't talked about so far, but I think here will become very important. So let's call this explanation. You can call it explanation as a concern, but I'm gonna put it here versus the alternative which is, maybe, association. That's the merely predictive, the merely predictive use of an algorithm.

A model not precisely to preview this concern, this is a question of whether it matters that we're using coincidences. The question of whether it matters that we can even explain how we arrived at the prediction, if we can't explain how we arrived at a prediction, there may be legal implications to using it, you see. So that's one, that's one example. And of course in science we never use associative modeling roles, and are always interested in explaining things, not just predicting them so we'll look at that more later. And then a set of concerns three is really to do the quality of the model.

So maybe we say quality, a word that really, more precisely I suppose, we would say is performance. And that's, are certain approaches going to give us better performance when we come to use them in the deployment phase and when we come to predict for the out sample are we going to get better performance with a certain approach? So, I think this is probably the right ranking, maybe, maybe not. Well, let's think about that, right? So, if we don't have data, there are lots of things you can't do. So yeah, probably number one of importance is the data. Doesn't matter whether you're trying to explain something or not, or whatever you're trying to do, you don't have the data, you can't do it. So data really is number one there.

Now, yes, actually number two ranked second here is whether you are going for something explanatory or something associative? Do you need to be able to account for what you're going to do, and how comfortable are you using features that seem to be related to the target, but you can't explain why they're related. How much do you care about that? That actually is a second concern because that's going to limit your choice of algorithm as well. And then the final concern which is in some sense, the one we'll spend quite a lot of time on, though it isn't maybe the most important one for this list is the quality or the performance of the approach. That's where the art of machine learning can really come in with concerns one and two there, you're set, those kind of just set the problem in many ways with concern three with this kind of notion of quality and performance, that's where the art comes in, the heuristics, the way of thinking about stuff that can really make or break a solution.

So, okay, good. So let's talk about each of these internal level and then we'll spend probably most of the time talking about performance. We may come back to explanations actually. So, number one data, data. So how does data make or break something here? Okay so, what I'm gonna do here is include a visualization from the SK Learn Library and that's going to give us some options around understanding how this data works. So there's that visualization. Let's just increase the size a bit. And let's take a look.

So, this is questions surrounding our data set, right? The questions surrounding our data set. So we start, let's do this in purple maybe, so we start over here and we go, okay, well, how much data do I have? If I've got more than 50 samples, then I can sort of continue with my analysis. If I don't, then using the standard techniques of machine learning, I need more data. So I can't do anything using the standard techniques. There are alternative techniques that are more like genuine statistics let's say genuine where the machine isn't doing as much work for you, so to say. That can be sort of helpful with the low-data approach. And those, for example, you would call maybe Bayesian, Bayesian Machine Learning, and that's where you kind of build a model by hand rather than having the algorithm give you one. And then you use this small amount of data as a guide to sort of tuning, tuning that. And you build it by hand in this case because you kind of maybe know how the model should look because you know what the environment of the problem is going to be, and so you can kind of say, well, it's definitely going to be this.

So a case of that would be, so where you have a small amount of data. So that may be infectious disease. So if you know the process is going to be exponential like that, you don't need that many points, maybe, along this curve to get a sense of the underlying rate. And once you have a sense of the underlying rate, you can then just go straight into plotting the exponential and then you know what the actual function is. So in the low-data case, there are more, let's say statistical, more upfront statistical techniques that you can use. In the case of off-the-shelf machine learning, you're really, you really, you can't do much there if you don't know, if you don't have enough data.

Okay, so let's continue. What are we trying to do with this data? This is where, to some degree, the problem enters so are we predicting a category? Then we go into this side here, so that's classification. If not, we're gonna go for regression, and if we're not trying to predict something of course we go down to clustering. Now, notice then that for predicting a category, we need have labeled data, that's just the same as saying, we need a why. So that's kind of obvious. From the supervised case you need both the features on target in the historical data set, otherwise you can't solve the problem, right?

So if you're answering no here, you can't do classification. So that means these may seem obvious or trivial things, anytime you need the future it's obvious maybe, but actually that can be fatal to a project. In terms of the strategic implication of that, if you don't have the data to solve the problem, there is no magic here, you can't solve the problem. So you actually need that data and the quality of that data and whether it exists or not could make or break a project.

So, let's have a look and then going into classification we can see that there are certain, let's say low-data approaches, there are certain high-data approaches, certain approaches for text, I'm gonna let you go through this yourself in detail, I'm just gonna highlight the key elements here. So, if we don't have label-data, so suppose we don't have label-data, what can we do? We can use the clustering approach, that's where we just have the features but no target. In this case, you can ask questions around where, those, whether, to try and calculate some sense of where categories might be in terms of the feature space, and you can look for that, right?

So you can see what the options are. And in regression you've got your sort of low-data approaches, you've got some higher data approaches, and so on. And then, if you're not predicting something, if you're just exploring a data set, then there are certain exploratory techniques you might use, one of which is this dimensionality reduction, and the role of dimensionality reduction in exploration of data is, if you have lots of columns that are difficult to visualize and this exploratory stuff here kind of just amounts to visualization, that's the key part of exploration is visualization. And maybe by visualizing data that can give you some insight into the problem so you can go back and solve it some other way.

So there are lots of different things here, but what I wanna point out is how much the data can determine the choice of approach that you have, and then in these green boxes are your approach. So, basically, this is linear regression here. We'll talk about what those words precisely mean but this is linear regression. Oh, whatever. And then you've got things like the K-Neighbors classifier over here, so that's the K-nearest neighbors, you have just regression anywhere, it's much more visualized here but you can see that there's gonna be algorithms in each of these boxes that you can use to solve the problem. So have a look at this visualization and then obviously think through your data sets that are available to you.

Now, okay, so that's data. Let's move on to talk about explanation.


About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.