1. Home
  2. Training Library
  3. Machine Learning
  4. Machine Learning Courses
  5. Building Machine Learning Pipelines with scikit-learn - Part One

Dealing with Categorical Variables


Machine Learning Pipelines with Scikit-Learn
Scaling Data: Part 1
PREVIEW12m 18s

The course is part of this learning path

Start course

This course is the first in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.

Learning Objectives

  • Understand the different preprocessing methods in scikit-learn
  • Perform preprocessing in a machine learning pipeline
  • Understand the importance of preprocessing
  • Understand the pros and cons of transforming original data into a machine learning pipeline
  • Deal with categorical variables inside a pipeline
  • Manage the imputation of missing values

Intended Audience

This course is intended for anyone interested in machine learning with Python.


To get the most out of this course, you should be familiar with Python, as well as with the basics of machine learning. It's recommended that you take our Introduction to Machine Learning Concepts course before taking this one.


The resources related to this course can be found in the following GitHub repo: https://github.com/cloudacademy/ca-machine-learning-with-scikit-learn


Welcome back. In this lecture, we're going to explore techniques for dealing with categorical variables. So to start, let's import pandas as PD, since we need to read some data using the pd.read underscore CSV method. In particular, the data we'll be using is in restaurant CV. And we store this in the variable DF. The data is data on tips left by customers in restaurants, in various different cities.

So just to give you an idea, let's inspect the first five rows. So this is the toy dataset, so don't worry too much about the values. The important thing is that we have a few variables which are numeric, namely, total bill and size. And the others of type object, such as sex, smoker, day in which the client went to the restaurant, and the corresponding time, which is encoded with two classes, lunch and dinner. So the columns that are not numeric are also called categorical variables, and we have to deal with them before ingesting a dataset into our machine learning model.

So remember, scikit-learn has two classes, transformers and estimators. If we want to ingest the data into an estimator, we need to pass those as numeric. Suppose the task is classifying a client that is generous or not, based on whether or not he or she provided a tip. So we have different strategies to deal with categorical variables in scikit-learn. A very common one is called one-hot encoding. One-hot encoding can be easily performed with just a few steps. We import the class one-hot encoder from the pre-processing sub module. And then we proceed as if it were a standard transformer.

So we initialize the class. And we'll store this into the variable O-H-E, and then we fit and transform one-hot encoder. So fit, transform, on the training data. And then, just for readability, we store this into a pandas data frame, and ensure that this is also forced to be an array. And let's store this into the variable O-H-E underscore DF. So let's investigate the first five rows.

Okay. So can you spot a potential problem here? Now, typically when we use one-hot encoder, we are dealing with at least one categorical variable. Suppose we wish to encode this smoker column. So that one there. So what we do is we do the following: we initialize one-hot encoder, and then we pass to the fit transform method, the DF in position smoker. And what we get is the following. So by encoding that column, we are basically creating two new columns corresponding to smoker no, and smoker yes. And those are dummy variables. A dummy variable is a variable that can take one of two values, zero or one. So a value of one in smoker yes, means that that example has value yes, in the column smoker.

So here we have 46 columns. And the problem is that one-hot encoder has encoded even the numeric columns. So how can we be sure about that? So we can use the categories attribute on the one-hot code object. So in this case we see just two values. Since we constructed an O-H-E object with just one column, namely smoker. So let's run the cell above again, so that we now see, we can see here, the total bill and size columns have also been encoded.

Now, this is not ideal. So think of having a dataset with 1 million values ranging from minus infinity to plus infinity, your memory would just blow up. So that's not feasible. So the natural question is: why should I use one-hot encoder if it doesn't allow me to choose the type of variables doing code? Well, one-hot encoder is the most efficient method in scikit-learn for creating dummy variables. But unfortunately, it does not deal with the possibility of choosing which column to encode. So this means that one-hot encoding works well with a single feature at a time. So that would be like what we did above for the smoker column. So this generates two columns because there are two possible categories. And the order from left to right is based on the observations.

So in this case, one is for smoker, and zero is for non-smoker. So one-hot encoder does not allow you to pass a list of columns to encode. But luckily, to perform multiple one-hot encoding in scikit-learn, we can tell the transformer, which column to process, thanks to the column transformer method. This is a very useful method. And in particular, we import from the compose sub-module. We import "make column transformer" method. And that's a method that looks like a pipeline. It is a manager that performs the dirty job for you. That is, it tells Python which columns to transform.

So how does it work? Well, we initialize a one-hot encoder, and then we create an object. I make column transformer, and we need to pass here a list of columns that we wish to encode. And we pass that inside a tuple containing two elements. The first one is the object O-H-E, and then the list of columns that we said before. So we can put in here smoker, sex, time, no, let's put day, then time. And the nice thing here is that we can tell python not to process the other columns by setting the argument remainder as the string "passthrough". So like this we're iteratively applying a one-hot encoder to each single feature specified in this list. So each feature there, and we can store this into a variable, CT, and then we apply the fit transform method to the training data. And let's now inspect that, inspect the first five rows.

Okay, this error makes sense, the data underscore new is not a pandas data frame, so let's convert it. So let's just do that like that. And there you are. So a quick inspection shows you that the other columns have not been encoded as dummy variables, such as total bill and the size. So let's now consider another example. Suppose you work for an insurance company and you need to create a model that deals with motor claims. You build a nice model to predict whether a given accident was fraudulent or not. And that has many features, such as for example, car manufacturer.

So let's build a toy example that goes as follows. So we're going to create a toy dataset. So we pass to a pandas data frame, a dictionary, that contains as its key, the feature constructor, which is made of the following values that we pass as a list. So Toyota, Audi and Renault, so here are the car makers. And we store this into data toy. But based on that data, we can build a one-hot encoder. And we fit and transform on our toy data. That's based on our one point encoder. Like so. And once again, we set this to be an array, and we store it into a pandas data frame. So basically, we have encoded the constructor into three dummy columns. So that makes sense since we have those three categories, or three possible values here via construct, the car constructors.

Now, suppose that we get some new data. So I'll define a new variable, and which we'll call it new data. And we pass in a pandas data frame, a dictionary, that contains two values for constructor. And we can put in here in a list, Toyota again, and Jeep. Now, we fitted a one-hot encoder to some data that had three categories. So what would happen if we try to transform the new data with that one-hot encoder? Let's try it out and see.

So we take one-hot encoder toy, and transform the new data. So we get an error, and the error is pretty direct. It says "found unknown category 'Jeep' in column zero during transform". And this makes sense, right? Since our transformer does not know what 'Jeep' means, since that value was never observed in the training phase. Therefore it's not able to classify that particular example. So what can we do? Well, one strategy might be to classify that example as not a member of any of the dummy variables created in the fit phase. So it would belong to, let's say a trash category. Scikit-learn allows you to do that by setting the argument 'handle unknown', equal to the string 'ignore', inside the one-hot encoder. So let's try that out now.

So we'll replicate this step. Now if we try to transform the new data, we should be able to handle this new category. So now we try to transform on the new data. We get a sparse matrix. So let's also set the argument sparse equal to false, but now we have to remove the application of the two array method inside the pandas data frame. So here we go. So we got an extra row for Jeep made of all zeros, meaning that none of the categories observed in the training is suitable to classify that example. So what is the implication of that? Well, we are implicitly telling the machine that the example belongs to another category.

So that concludes this lecture on handling categorical variables with scikit-learn. We have seen a very important method called one-hot encoder, to create dummy variables out of categorical variables. In the next lecture, we're going to explore techniques that are used to impute missing values. See you there.

About the Author
Learning Paths

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.