Building Machine Learning Pipelines with scikit-learn - Part Two
1h 9m

This course is the second in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.

Learning Objectives

  • Explore supervised-learning techniques used to train a model in scikit-learn by using a simple regression model
  • Understand the concept of the bias-variance trade-off and regularized ML models
  • Explore linear models for classification and how to evaluate them 
  • Learn how to choose a model and fit that model to a dataset

Intended Audience

This course is intended for anyone interested in machine learning with Python.


To get the most out of this course, you should have first taken Part One of this two-part series.


The resources related to this course can be found in the following GitHub repo:


Hello and welcome. My name is Andrea Giussani, and I am going to be your instructor for this course on Building a Machine Learning Pipeline with scikit-learn, Part Two. In this course, we're going to explore techniques that are used to fit a model, and to predict an outcome in a Machine Learning Pipeline using the scikit-learn Python library.

Remember that in scikit-learn, we distinguish between two main classes transformers and estimators. In Building Machine Learning Pipelines, Part One, we focused on transformers. So, if you've not taken that course already, I strongly encourage you to do so before starting this one. In this course, we will dive into the subject of estimators. In general, an estimator is a class that is characterized by two methods. A fit, that is used to learn something about the data, and a predict method, which uses a learned pattern from training data to predict new data.

In this course, we assume that the data you want to ingest into an estimator has been preprocessed. We will focus on supervised learning tasks, with two possible applications regression and classification. And we will not explore unsupervised learning techniques here. Please note that if you want to get an understanding of the difference between these two families of algorithms, I strongly encourage you to try the course Introduction to Machine Learning Concepts available in our content library. So in particular, we will explore supervised-learning techniques used to train a model in scikit-learn by using a simple regression model. Through that example, you will be introduced to the concept of the bias-variance trade off, and therefore to regularized models.

Finally, we will have a look at linear models for classification, and we will look at how to evaluate them as well. In a nutshell, the fitting procedure can be described as follows. Before feeding a model with the necessary data, we typically split the data into train and test sets using an 80/20 split. 80% is used to train the data and the remaining 20% is kept separate and is used after training to evaluate the performance of the model.

Then, based on the data and the application we have in mind, we choose a model and we fit that model to the training set. This is a very important step, since the supervised learning algorithm analyzes the training data, and produces a mapping function, which is called a classifier if the output is discrete, or a regression function if the output is continuous. For example, a classification problem is when the output variable is a category, such as classifying customers into fraudulent or not fraudulent.

A regression problem, on the other hand, is when the output variable is a real value, such as the price of a stock in a given month. We then predict, based on the trained model, the outcome for the test data, and we evaluate the performance of the model on the test set.

Now please note that the two datasets we are going to use in this course are available on the GitHub repository for this course.

The structure of the course is as follows. In the next lecture, we are going to build a standard machine learning pipeline with a linear regression model, and we will investigate an important concept for improving the performance of the training phase, known as cross-validation. In Lecture Three, we will look at the bias-variance trade-off, and you'll learn what overfitting and underfitting mean. In Lecture Four, we will deal with a class of models that are used to prevent extreme situations of either overfitting or underfitting, and these are known as shrinkage models. And finally, in Lecture Five, we will discuss linear models for classification, and we will discuss how to evaluate classification model output.

About the Author
Learning Paths

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.