Machine Learning Pipelines with Scikit-Learn
This course is the first in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.
- Understand the different preprocessing methods in scikit-learn
- Perform preprocessing in a machine learning pipeline
- Understand the importance of preprocessing
- Understand the pros and cons of transforming original data into a machine learning pipeline
- Deal with categorical variables inside a pipeline
- Manage the imputation of missing values
This course is intended for anyone interested in machine learning with Python.
To get the most out of this course, you should be familiar with Python, as well as with the basics of machine learning. It's recommended that you take our Introduction to Machine Learning Concepts course before taking this one.
The resources related to this course can be found in the following GitHub repo: https://github.com/cloudacademy/ca-machine-learning-with-scikit-learn
So to do so we can import the standard scaler from the scikit-learn pre-processing sub module. And we do that like so, and fitting a scaler in scikit is pretty smooth and it's pretty easy. You just need to memorize a few steps, which are shared and used by other transformers as well, so that fitting transformers will be similar across different transformers.
So we have to initialize the standard scaler and we assign it to the variable scaler, and then we fit the scaler to the Boston data. So we had the Boston data frame in there, and once fitted, we apply transform to the same data. And then we assign the resulting output to the variable x_scaled. And the result is as follows. It's a NumPy array, and we can pass this to a pandas data frame, in order to read it more easily. And we specify the columns as the list of the original Boston data frame.
This is the result. So we see that the data set is completely different. And can you spot the issue here? Well, we've converted the target median value as well, and we should not perform any pre-processing on that variable. We want to convert only the features and not the target. So we can define an x variable containing only the features. So it's the Boston data set except the median value. So all but the target variable are stored inside the x variable. And then we define a y variable, which is nothing more than the Boston data frame in position median value. And we now fit the standard scaler only to x. And then we basically replicate the previous steps, so that at the end, we have only transformed the features and not the target. And we store the result into the variable df_scaled.
Okay, now let's check the distribution of the data after scaling. To do so, I'm going to replicate the same example as before and using the same code, but now we melt the scale data. So, to do so, I'm going to create a new variable called df_scaled_melted, and that's just to classify that I'm using the scale data and I'm going to use the df_scal variable inside the melt function. Once melted, I can replicate the same logic as before, but now we pass df_scaled_melted inside the box plot.
Here is the result. So the interesting fact is that now the data has been scaled and as you can see, the median is approximately zero for all the variables, which is good. However, it looks like there is still intrinsic skewness inside the scaled data. So in other words, the standard scaler is good because it homogenizes the data, but still keeps track of the original patterns. We could employ advanced transformers that force the data to be Gaussian, but that goes beyond the scope of this course. The most important thing is that we understand that those methods work very well, especially in a pre-processing phase. So they allow us to transform our data and to prepare a consistent training dataset where each feature is associated with the same weight.
So here you see that the features have, more or less, the same weight. So, so far we have performed the following steps: first, we initialize the transformer, then we applied to fit on the data and the same data was passed to the transform method to get scaled data. However, we can do the same thing in just one single line by using the fit_transform method.
So let's therefore assign that to the object, here. So we'll create a new variable, x_scale_02, and we pass this to a pandas data frame. And as the columns, we assign the list of x and as usual, we will inspect the first five rows of this new data set. So if you look closely, you see that the result is the same as the one obtained by applying a fit and transform separately. So to check that we can call the head method on df_scaled and you can see it's exactly the same. So the take home message here is the following: you should go for scaling when you are interested in model performance. However, if you need physical interpretability of your model's feature, then scaling might not be the best solution.
Now we'll conclude this section by introducing a very important family of methods inside scikit-learn called pipelines. Pipelines denote a family of managers that ingest a series of passages and that perform all the necessary pre-processing for you. This is very useful, especially when you want to create new features with a feature engineering process, or when you want to build complex model pipelines, such as in natural language processing or computer vision.
So to do so, we import from the scikit-learn pipeline sub module, the pipeline object there. We then create a new variable, we'll call it transformer and it's a pipeline containing a series of steps. For instance, the first step is the application of the standard scaler. We pass a string and you can choose that yourself. We've put scaler here and then we'll pass StandardScaler like so, and then we can pass them up a scaler, say robust, and here we apply that RobustScaler, and we need to import the RobustScaler from the pre-processing sub module, like so.
So the nice thing about pipelines is that in just one class, we can define multiple steps for our pipeline and the pipeline is going to take care of which input to pass to each single step. Then we can apply a fit to the transformer by passing x data and then we transform the x data and we store this result into a pandas data frame. And we also pass the list of x as columns and we'll print the first five rows. So obviously the result is different since we have passed two transformers in this example.
So to wrap up, in this lecture, we have four different techniques that allow us to preprocess our data before ingesting it inside a machine learning model. We focused on the standard scaler class and we've understood its importance. We've also seen the pipeline object, which is useful to concatenate different steps in one instruction. In the next lecture, we're going to explore techniques for dealing with categorical variables. So I'll see you there.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.