Machine Learning Pipelines with Scikit-Learn
This course is the first in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.
- Understand the different preprocessing methods in scikit-learn
- Perform preprocessing in a machine learning pipeline
- Understand the importance of preprocessing
- Understand the pros and cons of transforming original data into a machine learning pipeline
- Deal with categorical variables inside a pipeline
- Manage the imputation of missing values
This course is intended for anyone interested in machine learning with Python.
To get the most out of this course, you should be familiar with Python, as well as with the basics of machine learning. It's recommended that you take our Introduction to Machine Learning Concepts course before taking this one.
The resources related to this course can be found in the following GitHub repo: https://github.com/cloudacademy/ca-machine-learning-with-scikit-learn
Welcome back. Let's now open a Jupyter Notebook, like the one you see here. Please follow the instructions available on the GitHub repository for this course to get the same work environment as this one. Let's investigate a few methods to perform pre-processing with scikit-learn. So first we import pandas. We import it as PD and we use the read_csv function to read the CSV. And I store this into the variable Boston dataframe and if I inspect the first five rows, this should be very familiar to us, right? This is the Boston dataset that we saw in the previous lecture.
Let's use the described method to get a few statistics on the Boston data. For each single variable, we can understand how the features are distributed. In particular, it looks like age is quite skewed, since the median equal to 77.5 is actually greater than the mean, whereas the tax variable is also skewed, but here we have the medium and in this one, the median is equal to 330 and that is lower than the mean. So for taxes, it actually makes sense that a larger fraction of the population pays lower taxes. And as for age, it is normal that in Boston, the houses are pretty old and the majority of the houses are therefor above the mean.
So let's look at the following chart. This is a correlation plot that shows each single feature on the X axis with respect to the target variable, namely MEDV. We have a few remarks to make here. So look at the chaos variable. It is a binary variable, and it has two points, either zero or one, meaning that it can only assume two values. Then we have some features are positively correlated to the median value, such as room. The higher the number of rooms, the higher the house value. Whereas we have variables that are negatively correlated such as ELSTAT or crime.
So look at this, you have negative correlation between median value and crime rate. Furthermore, you have a few features expressed in hundreds, like taxes and other features that are below 100, such as ELSTAT. Therefore, we have to find a way to be sure that our model will ingest data and assign the same weight to those features. You can go further and build a boxplot for each single feature. This is useful to better understand the data that we are looking at.
To do so, we import matplotlib, and in particular, we're going to be using the pyplot sub module and we import this using the standard PLT convention. We also import seaborn as SNS. And we use the pandas melt function to untidy the dataset under consideration. So in particular, we're going to apply the melt function to boston.df. So like that, we take it and then it goes, and we store this data into the variable data_melted.
Now, we use matplotlib to create subplots. In particular, we use the plt.subplots function and if you have taken the course, data visualization with Python using a matplotlib from our content library, then this should be very familiar to you. But here you just need to know that to create a plot, you typically initialize a subplots object, and then we define the axes, which is essentially the canvas where you draw your data. And we apply an SNS boxplot, by passing the column variable as X and as Y, the value that comes from that data_melted that we made before. And then it goes. We then specify the X label as empty and the Y label as the median value. So the MEDV variable and that is our target. And we show the result with plt.show and we can control the fixed size to enlarge the figure and so let's set this to 14 inches by 10 inches. And there you go.
Now, looking at this plot, we can get a lot of insights. So let's look at the magnitude first. So this is quite risky. We have two variables expressed in hundreds, whereas the rest are shrunk towards zero, and this is risky, because if I tried to fit a regressor and try to understand which features best described the model, then those features might affect the model performance. They could dominate with respect to the other features, since they're so much bigger.
Also, look at the skewness. Age and tax are really skewed. Age is skewed to the right, since the median is greater than the mean. Vice-versa, tax is skewed to the left and this makes sense since just a few people pay higher taxes, whereas the majority of people pay lower taxes. For age instead, we can speculate that in Boston, these figures here make sense, because being an old city, the median house age is greater than the mean.
So looking at this plot, the conclusion is that we would like to homogenize the data before feeding a machine learning model with it. We do not want a single figure to dominate over another variable and therefore, this plot shows us that scaling is crucial for our purposes. It's important that data are in some way processed before feeding them into a model. We need to scale the data, as they have features with different magnitudes. We want to unscrew the data like with the example of taxes and age that we saw before and finally, we want to scale the data because we want to assign equal weight to different features.
A downside of scaling is that we may lose physical interpretability of the model. So thinking about the taxes we saw before, the taxes may be expressed of some numbers that can no longer be linked to a physical interpretation when we do scaling. However, the upside of scaling is that the model's performance will improve.
So how do we implement scaling in scikit-learn? We have lots of methods to implement scaling. These models are applied to the training data and we're going to focus different models here in this course. In general, there are few transformers that perform scaling. The standard scaler, which is a sort of benchmark and should be used as a baseline, the MinMax scaler, the robust scaler and the normalizer. All of them are nice models, so it's worth investigating them.
To give you an idea, think about having the data shown on this slide. Data that can be grouped into two distinguishable clusters, namely green and black, that are indeed pretty simple and if you look at them, they are not really heterogeneous. They look very homogeneous ranging on the X axis from 10 to 15 and on the Y axis, we have a range from zero to 10.
Suppose we wish to apply a scaling procedure on these data, the first choice would be the standard scaler. The standard scaler is a transformer that learns from the existing data, the mean and the standard deviation for each feature and then it applies them to the new data via the transform method. In particular, transform is going to standardize the features using the mean and scale. If you note, this scalar shrunk data to an interval between minus two and two.
Then we have the MinMax scalar. This is used when we know there is an upper bound for our data. This is not used all the time, but there are cases where this scalar must be used. Imagine you're working in biostatistics and you have to predict whether where a patient is going to have diabetes or not. A very important feature is say, the level of cholesterol in the patient's blood. You can be sure or say with very high probability that the cholesterol level will never be above 1000. So typically you employ this kind of family when you are sure that there exists a lower and upper bound in the distribution of data.
Instead of calculating the mean and standard deviation for each feature, the MinMax scaler learns the min and max and performs a specific transformation on the data. It basically maps all the data on an interval between zero and one on both the X and Y axes. Note that both the standard scaler and MinMax scaler are very sensitive to the presence of outliers. Then we have the robust scalar. This is similar to the standard scaler, but instead of calculating the mean and the standard deviation, it looks at the median and the percentiles. It is robust because we know that the median is a nice estimator and is robust also to outliers and it shares nice statistical properties with the mean. It is better when we have heterogeneous data.
Finally, we have the normalizer. This is used with count data. So again, it is not used a lot, but performs a specific transformation based on an L2 norm to each single feature vector. Note that the normalizer can also allow an L1 norm, which basically please translates into a normalization by the sum of absolute values of the observed data. It produces a transformation that looks like this, by applying a homogenization of the data with respect to that particular norm, namely a euclidean or squared norm. Okay, let's now move over to the demo environment and let's try to apply a standard scaler to the Boston data.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.