This course is the first in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.
- Understand the different preprocessing methods in scikit-learn
- Perform preprocessing in a machine learning pipeline
- Understand the importance of preprocessing
- Understand the pros and cons of transforming original data into a machine learning pipeline
- Deal with categorical variables inside a pipeline
- Manage the imputation of missing values
This course is intended for anyone interested in machine learning with Python.
To get the most out of this course, you should be familiar with Python, as well as with the basics of machine learning. It's recommended that you take our Introduction to Machine Learning Concepts course before taking this one.
The resources related to this course can be found in the following GitHub repo: https://github.com/cloudacademy/ca-machine-learning-with-scikit-learn
Welcome back in this lecture, we explore a few techniques that might be used to impute missing values in a dataset. For various reasons, many real-world datasets contain missing values, often encoded as blanks or other placeholders, such as none. Such datasets however, are incompatible with Scikit-learn estimators, which assume that all values in an array are numerical. But more importantly, that each value has and takes on a precise meaning. Our basic strategy to use incomplete datasets is to discard entire rows and or entire columns containing missing values. However, this comes at the price of losing data, which may be valuable.
So a better strategy is to impute the missing values. That is to infer them from the known part of the data. Please know that this strategy must be performed only if the imputation makes sense, otherwise we're going to bring bias into our data. So let's distinguish between two different imputation algorithms. Univariate versus multi-variate imputers. So I would like to show you the use of those imputers with a toy example. And that's done for you here in this snippet.
So we have this dataset called data_fake. So suppose we have this dataset here and it contains three columns, namely items, age, and cost. And as you can see, we have a few null values in this particular table. Now one possibility would be to simply drop these rows. So the rows here, where we have the null values, but in this case, we would end up with just one row left. And that would obviously not be good because we would be losing a lot of information and probably the model would underperform as a result. So what we can do instead is apply an imputer algorithm that basically imputes the missing values from the known part of data.
Now we have two families of imputers. We have univariate imputers that are identified by algorithms that typically impute values in the ith feature dimension using only non missing values in that feature dimension. By contrast, multi-variate imputers use the entire set of available features to estimate the missing values.
So let's start our investigation with a simple imputer. And this is a univariate imputer. So this is going to take the dataset and for each single feature, it's going to impute the missing values based on a strategy that you specify based on the actual data. So say that you fill the value with the mean or median of the available values in that column. So therefore from Scikit-learn from the impute sub-module, we import the simple imputer and we initialize a simple imputer and this requires two arguments, or at least two arguments. So we're gonna put the missing values in, and we're going to specify the value to look for inside the data, so in this case a non-value, and then the strategy we also specify.
So in this case, we'll set it to median and that means that we are going to fill the missing value with the median for that column. So we're going to look at the median for that column and then fill those values in to the non values. So to do that, we apply a fit and a transform to the data, Data_fake data set that we have. And here we are. So for the cost column, we have imputed the value 11K which is the median. And for column, the age column, instead the median corresponds to the mean and the same also in items as well. We can do the same by setting strategy to mean instead of median. And of course we're going to get different values now.
So now the natural question is how can I know which features were imputed? And this is a good question to investigate since it's going to come in handy in the future. So to do that, we use the argument add indicator. So we put add indicator in there and this by default is equal to false, but we set it to true and then we run the cell and now we have new columns. So we have three extra columns and each is a dummy variable denoting wherever that feature was imputed or not. So if it's equal to zero, then we know that that value is not imputed. And if it's equal to one, then it was.
So we can get a better understanding by passing this object to a panda's data frame. And then you see that for rows zero, the first columns 8K items is imputed since it is equal to one. Whereas rows two and three, we see the last column has value one and therefore that row is imputed in position cost. So this is the simple imputer and simple imputer takes only one feature into account to perform the imputation. Sometimes however, it is good to impute values for different columns. And that leads us to multi-variate imputers.
Suppose we have two features, let's say age and cost. And we have a few values for costs missing. And we assume that cost will increase with age. So you think of a car, the maintenance costs will increase solely with age, right? So if that holds, then we expect to impute a higher cost with a higher age. So to do so we have two possible strategies. So the first one is an iterative imputer based on a Bayesian Ridge Regression Model. And this predicts the missing values based on all the observed non missing features.
So suppose we want to impute the value for cost for each row where the data is missing, the iterative imputer is going to fit a regression model and it uses items and age as features. So to predict the missing cost, it passes the values of other features to the model, and this model is a Bayesian Ridge, but you can choose which models to fit if you want. And at the end, it predicts the missing value.
So let's now see how this works in practice. We need to import the class enable iterative imputer from Scikit-learn.experimental, and then from Scikit-learn.impute, we also import the iterative imputer. We then initialize an iterative imputer and we use fit transform on the data, on data_fake that we saw before. And we get this result. So we get a value for items that is pretty high, and we get imputed values for costs, which are creasing with respect to age, and then we get an imputed value for age. So this makes sense. We can however, use another imputer known as the K-NN Imputer. And this is similar to the iterative imputer, but works differently in the backend.
So the K-NN imputer is based on a K-NN model and it imputes the missing values with the value observed for the K most similar rows. So suppose I want to impute the missing value in the column cost for the row number two. K-NN importer is going to look for the most K similar rows inside the dataset based on known features. So in our case, this is age and items, and then it's going to take the average of the corresponding cost values and that value is then used to impute the missing values.
So we import from Scikit-learn.impute the K-NN computer. We initialize the K-NN as usual. It's a K-NN imputer and I specify the number of neighbors argument equal to two. And that means that I'm going to look for the two most similar rows inside the data frame. And then we apply fit and transform on the data using the K-NN imputer. And here is the results.
So I got different results for the age, and let's try to understand why. So before we were fitting a regression model to the data, whereas here, we're looking at the top two most similar rows inside the data frame in order to impute the missing value. So for instance, it's going to look at this row here and also this row. And then it takes the average and is equal to six, and it does the same for the other values. And that concludes this lecture. We've looked at how to perform imputation of missing values with different strategies in Scikit-learn. And we have seen two possibilities, univariate and multi-variate imputers. Thanks for watching.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.