This course is the second in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.
Learning Objectives
- Explore supervised-learning techniques used to train a model in scikit-learn by using a simple regression model
- Understand the concept of the bias-variance trade-off and regularized ML models
- Explore linear models for classification and how to evaluate them
- Learn how to choose a model and fit that model to a dataset
Intended Audience
This course is intended for anyone interested in machine learning with Python.
Prerequisites
To get the most out of this course, you should have first taken Part One of this two-part series.
Resources
The resources related to this course can be found in the following GitHub repo: https://github.com/cloudacademy/ca-machine-learning-with-scikit-learn
Welcome back. In this lecture, we are going to explore an interesting family that extends the standard linear regression model by introducing some bias to the model. This is the family of Regularized models, also known as Shrinkage models, and they represent a popular alternative to a linear regression model. In general terms, they extend the standard linear model family by introducing some bias to the model. How much bias is introduced in the model is controlled by the model itself.
In particular, we are going to explore one member of the family, the Ridge regression model. And in general, one method of creating bias regression models is to add a penalty to the sum of squared errors since, as you know, the objective function is to minimize the sum of the squared errors. So, by sacrificing some bias we can often reduce the variance enough to make the overall mean squared error lower than an unbiased model. This happens typically with respect to the standard regression model, since this is defined as the best linear unbiased estimator.
Don't worry about this statistical concept, it just means that the solutions obtained by an ordinary least squares model have minimum variance across all unbiased estimators. And the Ridge model is built on top of this concept. We have a trade-off between the simplicity of the model with regards to the coefficient estimated and its performance on the training set. How much importance the model places on simplicity versus training set performance is specified by the parameter alpha.
Notably, increasing alpha in a Ridge model forces the coefficients to move toward zero, which has the effect of decreasing training set performance but might help to generalize well on the test set. Typically, as the alpha parameter goes up, model complexity goes down, but the test score should, at least for a while, go up. Regularized models, such as the Ridge model, depend on the alpha parameter, which needs to be estimated, and in this lecture, we are going to understand how to estimate it.
So, the obvious question is how do we evaluate those models? First, we need to find the optimal value of alpha that allows us to estimate the model but we also have to pay attention to the fact that Ridge solutions are not equivalent under scaling of the inputs, so typically we need to center and scale the predictors so that they are in the same unit. In other words, Ridge regression regularizes the linear regression by imposing a penalty on the size of coefficients. Thus coefficients are shrunk toward zero and toward each other, but when this happens and the independent variables don't have the same scale, the shrinkage is not fair.
Two independent variables with different scales will contribute differently to the penalized scale because the penalized term is the sum of the square of each single coefficient. To avoid this kind of problem, the independent variables are centered and scaled in order to have a variance equal to one.
So, let us move on to a Jupyter notebook, like the one that you see on the screen. I have already initialized the Boston data and I have split the data into a train and a test set. So, now what we can do is from the linear model we import the Ridge model, so we import the Ridge and the procedure is the same as what we would do with a standard estimator, So initialize the Ridge model, and we set the argument normalize equal to true in order to avoid an unfair shrinkage effect. And then we fit a Ridge on the train data, and we're basically going to predict on the test set. And so we're going to predict on the test set, and we store the output into the y-pred variable.
Then we compute the training performance. And again, since this is a linear model, it's going to be based on an r square metric, which I am going to define now. R2_train and then we do a Ridge score on x_train, on x_train and y_train. And here is the result of the training set. And we can also compute the r square for the test and in particular, we are going to perform the following. You call the score method using x_test and y_test and we print the test performances with respect to the r square test variable.
So, this is the regression score with normalization and we can however see the effect of regularization on the coefficient. So namely the effect coming from the alpha argument. And we do that by performing the following operation. So we import NumPy as np and then we initialize a ridge model and we specify normalize as equal to true and we define alphas as a logspace. And we define values between minus three and three and we define 10 possible. And then we are going to do a for loop with respect to the alphas and then we proceed as following.
So we set the params. In particular, we set the parameter alpha equal to the variable a, and then we fit a ridge model that now has two parameters. And then we pass the x_train and the y_train and then we get the ridge score. And we append to the coef, the coefficients that we just estimated to the coef list. And then we can build a plot using motplotlib.pyplot function and we report this as plt, and then we define a figure in the axis using the subplots function there and we specify a figsize of 10 by eight inches. And then we apply a plot on the axis and we pass the alphas and the coefficient. And we can then set an xlabel and we'll set this, specify this as alpha, and then we specify the y label, and this is the regression coefficient. and then since the alphas are in a logspace, we basically set the x scale, so x in the x scale as the log, just the string of log there, and to get a better output we also set xlim equal to the get_xlim method from matplotlib. And here is the result.
So, we can also improve the plot by using markers, so here we can use star, so that there you are we can a better understanding of where each point lies. So this is the result. We can see that the regression coefficient has shrunk toward zero since the ridge regression is performing this operation. And in particular, you can see that alpha is pretty small and we are actually in a situation similar to the one with the ordinary least squares model.
So the magnitude of the regression coefficient is pretty heterogeneous and therefore, by regularizing the model, meaning by making alpha greater, the coefficients are shrunk toward zero but also toward each other. And in particular here, you can see that as alpha is greater than one, the regularization effect is more evident.
So why should I regularize the model? Well, there are two reasons. On one hand, as we have seen before, sometimes it's better to reduce some bias in the model since it would penalize larger coefficients. But more importantly, this way we are assigning equal weight to each single regression coefficient and therefore if we perform this, it may happen that the estimation for each single regression coefficient has lower variability.
So, if you remember we got a training performance that was at 0.64, here, and that score depends heavily on the splitting procedure but also on the alpha that we chose to fit the ridge model to. By default, the alpha is equal to one, and alpha is the parameter control of the regularization. And therefore what we can do is basically perform what is called a grid search cross-validation. And this is nothing more than a cross validation but it's extended by allowing you to estimate the hyperparameter you are interested in as well.
So for instance, now we are interested in the alpha hyper parameter, therefore the grid search is going to perform cross validation but for each single step, it's also going to look, inside a given parameter grid, for the best value for that parameter that minimizes the objective function. So in order to do that we import the model GridSearchCV from model selection, and as I said we specify a param grid, and in this case we are going to set this equal to a dictionary made of the hyperparameter we wish to estimate, in this case alpha, and then we basically specify the grid.
In this case, it's the logspace between minus three and three and we define 10 possible values. At the end we specify inside the variable grid a GridSearchCV object which ingests the estimator, in our case the Ridge, the param grid that we have just specified, and then we pass the number of cross-validations we want to perform, so let's say 10, and then we also return the train score, which is an optional argument, and we set this equal to true.
We then finally fit the grid object to the train data and finally, we get a model that has been fitted by a grid search cross-validation. But now we can access the best score, using the attribute best_score. And in our case, it's 0.6995 more or less. And, not surprisingly, we can also access the best parameter using the attribute best_params. So this model has been fitted to an alpha of 0.02 which is different from the default and we got a score for our model of 0.6995 and, if you remember, this is higher than the score we got at the beginning of this lecture by using the standard procedure.
So by standard procedure I mean splitting the data and passing the whole training data without performing cross-validation. Note that the Ridge regression model is just a member of the family of regularized models, there exist many other models for regularization, one of which is the Lasso model but you can also get a mix of the Lasso and the Ridge models which is called the Elastic Net model. But we're not going to cover those in this course, but it's good that you know at least that the family of regularized modes is made up of many other different models.
So that concludes this lecture. Here we have looked at the importance of regularized models, in particular the Ridge regression model, and we have also seen how to perform Grid Search Cross validation that is particularly useful when we want to perform cross validation with a model that depends on some hyperparameters. Those hyperparameters must be estimated and Grid Search Cross Validation does the dirty work for us and performs both Cross-Validation and the estimation of the parameters under the hood. Thanks for watching. I'll see you in the next one.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.