Machine Learning Pipelines with Scikit-Learn
This course is the second in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.
- Explore supervised-learning techniques used to train a model in scikit-learn by using a simple regression model
- Understand the concept of the bias-variance trade-off and regularized ML models
- Explore linear models for classification and how to evaluate them
- Learn how to choose a model and fit that model to a dataset
This course is intended for anyone interested in machine learning with Python.
To get the most out of this course, you should have first taken Part One of this two-part series.
The resources related to this course can be found in the following GitHub repo: https://github.com/cloudacademy/ca-machine-learning-with-scikit-learn
Welcome back. In this lecture, we are going to implement a single machine learning pipeline with scikit-learn using linear models. So let's first import pandas, and we're using the convention pd, and we're going to use the read_csv method in order to to read the data stored in the file boston.csv. We then store this into the variable df.
Now, once we have the data, we can split the data into a training and a test set, since we are only going to fit the model to the training data. And then, once we have a fitted model, we can test it using an independent dataset that is described by the test set. So to do this, we use the scikit-learn library, and in particular the submodule called model_selection, and we import the train_test_split function. This is a very useful function since it allows us to split the data into a training set and a test set.
So we can, for instance, call this function by passing the data frame, and then we can specify the test size, so let me say 0.2 since we want to apply a strategy of 80% for the training and 20% for the test, and we also set a random state, let's put 42. And this is going to return two objects, train_df and test_df. And we can check this easily by using the shape attribute, the original set contains 506 rows and the train set now contains 404 rows, and therefore the remaining 20% of the original data frame is in the test set.
Okay, so this is a valid strategy, but sometimes we can go further than the simple splitting procedure we have just described. So what we can also split is the data based on the target and the features. So let's define the set of features as X, which is nothing more than the original dataset dropped by the target code, which is median value in this case, and we perform this inferation by columns, namely by setting the axis argument equal to 1. And then we specify the target, that in this case is the median value column, and now, since we have split the data into the set of features and the target, we can apply the train_test_split function, but now we don't pass the whole data frame anymore, but we just pass the X's and the y's separately. And therefore, this object is going to return four different objects, namely X_train, X_test, y_train and y_test. And you can check this easily since the shape of the X_train should be equal to the shape of train_df, in this case 404, here we go, and y_train should be consistent as well.
Okay, so now we've split the data into the train and the test set, we can fit a simple regression model using scikit-learn, and in particular, we'll use the linear_model submodule, we import the LinearRegression class, and we initialize an object that I am going to call regression all here, reg_all, which is nothing more than the LinearRegression scikit class. And then basically, once I have initialized a LinearRegression object, I can fit that object, and I use X_train and y_train, and this is nice since we are able to basically fit the model only to the training data with this simple syntax here. And this is the output, so I fit a regression model and I got basically a LinearRegression object, and then, for instance, I can now get the training score. And the training score can be obtained by applying the score method, and I pass two arguments, X_train and y_train, and by default, all linear models inside scikit-learn return the R-squared as output, so let's also round this value and we store it into this R-square score variable, and then we finally print, you can print here, training score is r2_score, oh, there's an extra element there, okay.
Okay, the score is basically 0.75, meaning that our model, our linear regression model on the Boston data, performs pretty well, and that's because we got a high R-squared figure, so this is a very high R-squared figure, and since this is a measure that has the lowest bound of 0 and an upper bound of 1, and it means that as the R-square gets closer to 1, your performance is pretty good, whereas as the R-square nears 0, the performance of the model gets worse, meaning that your model doesn't explain anything about the underlying data.
So in this case, 0.75 is a good result, and then we can go further and say, well, this was on the training data, so the model has learned some patterns, but how well does it perform on the new data? So, we're going to use the test set that so far we have kept independent from the training set, and in particular, we're going to apply the predict method to the test set, and this is basically going to return the prediction for each single element of the test set for the median value. So here's the result, and we store this into the variable y_pred, and now we can ask, well, how can we evaluate the model?
So remember, in the test set you still know the true label, and therefore since we are in a regression task, we might compare the errors that we made in the prediction phase with respect to the ground truth. So we can perform the following steps. First, let's compute the R-square again, we'll compute the R-square again for the test, and this is nothing more than the application of the regression model, we apply the score to that model, but now we use the test set, obviously, and here's the result, 0.66, which is still pretty good. But we can use different metrics, so for instance, we can use the mean square error, and the motivation for this is pretty simple, the mean square error is actually used as the loss function in our model. And typically when we use linear regression models, we want to minimize the mean square error, so therefore we import from metrics the mean_squared_error function, and then we pass the y_test and the y_pred variable that we just computed, and we typically square that number, so we use the NumPy function, square root, and therefore we also import numpy as np, and here's the result, the root mean square error on the test set for our model.
Okay, so as I said, we would like to understand how our model performs by checking the errors our model made in the prediction. So we can define an error variable which is nothing more than the difference between the y_test and the y_pred variables. So basically, to give you an idea, we can get the same result as we got with the mean square error in the following way. We can basically apply the square root by using NumPy, and then we basically take the mean of the squared error, and we do that by calling the NumPy mean on the squared error, and you get exactly the same result.
So errors are quite relevant, because not only do they have the mean square error, which is used to evaluate your regression model, but it is also important to physically check the distribution of the error as well. To perform the distribution of the error, we basically want to firstly produce a plot that allows us to understand how our model performs, and then we will try to check the actuals versus the predicted values with another plot.
So to do so, we import matplotlib.pyplot as plt, and then we basically initialize a subplot and we define the following, subplot, so we're going to call a histogram and we pass the error variable here, and then we define the x-label by passing the error string, and then we show the plot. We can obviously change this too if we want, so we can make it bigger, so let's set figsize equal to 10 and 8, and here's the result.
Okay, it seems our model performs pretty well since we have a peak in the observations of the errors between minus 5 and 5, and you also see we have the maximum around 0. So this performed pretty well but we still have some errors as you can see here from the tail of our distribution. So given this, we can also check the actual prediction versus the actual values, so to do that, we do the following. We create a pandas data frame and we pass a dictionary that has as its key the actual value, the ground truth, basically, which is described by the y_test variable. And then we have the predicted values, and here we have y_pred, and I also reset the index so that I don't have any index coming from the original dataset with me, and then I store this into data_pred_df. And then we basically take this object and use the pandas plot method by specifying the type of the plot, so let's say equal to the string bar, but we do this for the first 10 elements, let's say, so that the plot is readable. Ah, there's a double dot there, which is not necessary, okay, here's the result.
So actually, it seems like we are in the first example again, we are always predicting more than the actual value, and this implies, essentially, that our model is not conservative, so it always predicts a little bit more than the actual value. So let's check whether this is also shown in the tail, and as we can see, that pattern is not confirmed in the tails, the tails instead have predicted values that are lower than the actual values. So again, this is the distribution of the prediction error, but we can also get a better understanding of the actual price versus the predicted one by plotting a scatter plot of the predicted values versus the ground truth.
So to do that, we initialize a subplots object and we specify a figsize of 10 and 8 inches again, and then we apply a scatter plot to the axis object by passing y_test and y_pred, those two variables in there, and then we draw a line in a grid from 0 to 50 for both the x-axis and the y-axis, and here we specify the linestyle as well, equal to double dash, and then we basically set the x-label, and we set this as the string Actual value, and then we set the y-label as the string Predicted value, and then we show the results.
So this is the plot, this is nothing more than a line that has been plotted from 0 to 50, and here is the dependence, so that is the relationship between the actual values and the predicted values. So, having said that, a natural question would now be, are we happy with this training score of 0.75? Well, if we think that using the 80/20 strategy in the splitting of the training data was a completely arbitrary choice, and that choice is dependent on the way in which we split the data.
So the way in which we split the data has an impact also on the model's performance, and that's because the value that we observe here is nothing more than the value that we observe by fitting a model with a particular training dataset. Therefore, if we'd split the data in a different way, we may have got a different value. So the natural consequence is that this score that we observe might be biased somehow by the way in which we split the data. We want to avoid our model being biased with respect to this splitting strategy, and therefore one possibility is to employ cross-validation.
So, cross-validation is a powerful yet simple idea that goes as follows. Instead of splitting data into training and test and then training the model using this dataset in one go, we can split the data into K-folds, so in this case, we have five, and then we're going to split the original training data into K-folds. One is kept for the test, namely the blue one, and the remaining four folds are used to train the data, so you train the data with these four green folds, and then once you get a fitted model, you test that model on the fold 1 and you get a score. Then you repeat this using the same model. You repeat the procedure but now you pass folds 1, 3, 4 and 5 to the training set, you then train that model and once it has been trained, you test with fold 2 and you get a result. You repeat this procedure for each single split and at the end you get five different scores coming from the same model trained with different data. At the end, you can save the mean of those scores, and that mean is essentially an average of your model performance on the data. Now this is useful because in this way, the algorithm is less dependent on the split and each data point is in the test set once.
Now another interesting aspect is that cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training the model, it's not only trained but also tested on all the available data, so once you've got these parameters, you are still able to test the data as we did previously. But the interesting fact is that now the score that we get in training is more stable, and is less dependent on the original split on the data we performed.
So, how can I perform cross-validation using scikit-learn? Well, this is pretty direct since we can actually import from the model_selection submodule the cross_val_score function, and then we use it and we pass the model, so reg_all, the data identified by X_train and y_train, and then we specify the argument cv, which is, by default, equal to 5, and then, for instance, we can specify it as 10 here. We store this into the variable cv_scores, and here is the results. We get 10 different scores related to the training data, but as we saw before, each single score was obtained using different folds at each single step, and therefore the numbers are different across steps.
So here we have 0.3, more or less, versus 0.78. It's a big difference, and in this way we are sure that our algorithm is less dependent on the way we have split the data. And as I said, instead of using those numbers, we can get an aggregated number, so let's say a mean, and therefore we can say our algorithm performs, on average, with a score of 0.6986, which is indeed similar, if you remember, to the 0.75 score that we got at the beginning by fitting the whole training data.
So why is this? It's because here we're using cross-validation, and another important aspect is the cv argument. The cv argument is the number of splits we perform in the cross-validation procedure, and therefore what we can do is the following. We can try to perform cross-validation on a linear regression for different values of the cv argument.
So, for instance, we can perform the following. We perform a for loop, and we decide on a grid for cross-validation, so let's say 5, 8, 10, 15, 20, and we are basically going to check the scores obtained for each single run. And in particular, here we're gonna pass the elem function since we are in a for loop, and then we can store the results into a pandas data frame called metrics_df, it's a pandas DataFrame, and this can be populated with another data frame that I called temp_df, and temp_df is nothing more than two columns, cv, which is the elem variable, and then temp_df in position average score, and this is the mean of the cv_scores variable, we take the mean, and finally, we append temp_df to the metric, and we reset the index.
And here's the result, so we see that, although it's not significantly different in this dataset, the Boston dataset is quite small and an illustrative example, but we see that we have an actual decrease in the average score of the cross-validation, and this makes sense since the larger number of splits, so the larger the number of splits, the smaller the data inside each single fold. In extreme cases when each fold contains just one single data point, it is very risky, since in the test you get just one observation during cross-validation.
So because of this, it is always worth finding a good balance between the number of elements inside each fold, and it depends on the number of folds you specify. Typically, a number between 5 and 10 is a good choice, because choosing a larger value for cv will slow the working memory on your local machine. So that concludes this lecture on building a simple machine learning pipeline with scikit-learn.
In the next lecture, we are going to cover one of the most important concepts of machine learning, which is the bias-variance trade-off, so I'll see you there.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.