This course is the first in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.
Learning Objectives
- Understand the different preprocessing methods in scikit-learn
- Perform preprocessing in a machine learning pipeline
- Understand the importance of preprocessing
- Understand the pros and cons of transforming original data into a machine learning pipeline
- Deal with categorical variables inside a pipeline
- Manage the imputation of missing values
Intended Audience
This course is intended for anyone interested in machine learning with Python.
Prerequisites
To get the most out of this course, you should be familiar with Python, as well as with the basics of machine learning. It's recommended that you take our Introduction to Machine Learning Concepts course before taking this one.
Resources
The resources related to this course can be found in the following GitHub repo: https://github.com/cloudacademy/ca-machine-learning-with-scikit-learn
Hello and welcome! My name is Andrea Giussani and I'm going to be your instructor for this course on Building Machine Learning Pipelines with scikit-learn, part one. In this course, we are going to explore techniques that are used to process and transform data in a machine learning pipeline using the scikit-learn Python library.
This course is intended for anyone interested in machine learning, which has become a buzzword nowadays, and if you've ended up here, it's likely because you're interested in this topic. For a general introduction to this field, please check out our course Introduction to Machine Learning Concepts which is available in our content library.
Now the objectives of this course is to get you acquainted with the different preprocessing methods in scikit-learn. At the end of this course, you will be able to: Perform preprocessing in a machine learning pipeline. Understand the importance of preprocessing. Understand the pros and cons of transforming original data into a machine learning pipeline. Deal with categorical variables inside a pipeline. And finally, manage the imputation of missing values.
As I said, everything will be implemented using scikit-learn, which is a very popular library in Python for machine Learning. In particular, scikit-learn is an open-source library that supports supervised and unsupervised learning. It provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other methods related to data management and transformation.
In particular, we are going to focus on the family of preprocessors, since those are used in the initial phase of a machine learning pipeline. We are going to get our hands dirty throughout the course, using practical examples and a demo environment to explore the methodologies covered. We'll look at the pros and cons of each single method, and when they should be used.
Historically, scikit-learn was developed by David Cournapeau and Matthieu Brucher in 2008. Its first software release is dated 2010, but the official release was in 2013. This course requires scikit-learn version 0.20 or higher, and the motivation for the use of that library is that it is easy to install and it has nice features, as well as good documentation, which is obviously available online.
Before we go on with the course, let's first cover off some basic but key machine learning terminology that we will continually encounter throughout the remainder of this course. First a feature, this is an item or piece of data that is described by a number of features, also called variables or columns. Label, each item or piece of data may be tagged with a known word or phrase, also known as a label. Then we have an example, also known as sample or statistical observation, which is basically a row in our dataset, and each instance will have a set of features, and may also have a label assigned to it. The example here is related to the Boston dataset that we are going to use in the next lecture. It contains different features, and a target variable called MEDV.
Since the first part of the course is based on this dataset, it makes sense to spend some time looking at now how that dataset is structured. Note that we have a few features, such as crime, which are not directly observable in nature but that can be observed with some feature engineering. Then we have some self-explanatory and directly observable features, such as age and taxes. And then we have some indicators such as industry or chaos, that can be inferred from other statistics.
We will get back to this dataset as soon as we move on to the demo environment, but for the moment it is important that we understand the terminology that will be shared during this course. The dataset is available on our GitHub course-specific repository here, the link to which can be found in the transcript for this lecture below and also in the course description.
Scikit-learn can be divided into two major categories of classes: estimators and transformers. We will focus on transformers first, since those are used in the dataset creation phase. In particular, transformers are characterized by two methods. A fit, that is used to learn something about the data, and a transform, that uses what the transformer has learnt about the data. And then we have estimators, which are characterized by a fit, that is used to learn something about the data, and we have a predict method.
Think of a transformer as a util that is going to convert your data, and basically the transformer is going to learn some patterns in the training data, so that if you ingest new data, those patterns will be memorized by the machine. On the contrary, an estimator is a model. Think of a regression model that is used to estimate the stock price in two months. You fit the model using historical data, and then you predict new data. In particular, you want to predict something that an estimator does not know a priori, but can possibly understand from the training data.
It's important you understand the main difference between the two families. In this course, we are going to cover the family of transformers. The course is structured like this. In the next lecture, we are going to explore a few methods for performing preprocessing, and we will focus on the standard scaler.
In lecture three, we will deal with preprocessing methods for categorical variables, and in lecture four we will investigate methods that are used to impute missing values. So if you're ready, let's get started!
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.