PySpark - How to build a Machine Learning Pipeline

Lab Steps

PySpark - Machine Learning Pipeline

The hands-on lab is part of this learning path

Ready for the real environment experience?

Time Limit1h


In this hands-on lab, you will master your knowledge of PySpark, a very popular Python library for big data analysis and modeling. Here, you will learn how to create a machine learning pipeline using the PySpark library, and to perform metric evaluation and model tuning.

Your machine learning skills will be challenged, and by the end of this lab, you should have a deep understanding of how PySpark practically works to build data analysis pipelines.

Learning Objectives

Upon completion of this lab you will be able to:
  • fit a Logistic Regression model in PySpark;

  • perform cross-validation in PySPark;

  • evaluate the model performances;

  • perform inference on new, unseen data.

Intended Audience

This lab is intended for:

  • Those interested in performing data analysis with Python.
  • Anyone involved in data science and engineering pipelines.


You should possess:

  • An intermediate understanding of Python.
  • Basic knowledge of SQL.
  • Basic knowledge of the following libraries: Pandas.
About the Author
Learning paths4

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.