hands-on lab

PySpark - Preprocessing

Beginner

Up to 1h

469

4.6/5

Start lab

Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.

Learn and validateUse validations to check your solutions every step of the way.

See resultsTrack your knowledge and monitor your progress.

Description

In this hands-on lab, you will master your knowledge of PySpark, a very popular Python library for big data analysis and modeling. Here, you will learn how to create a dataset using the PySpark library, and to manipulate it using standard filtering and slicing techniques. Your data management skills will be challenged, and by the end of this lab, you should have a deep understanding of how PySpark practically works to build data analysis pipelines.

Learning Objectives

Upon completion of this lab you will be able to:

create a Spark Session, and store the data into a Spark DataFrame;
query data with PySpark using standard SQL;
create a new column inside the Spark DataFrame;
perform standard data cleaning - type consistency, filtering, slicing;
pivoting and manipulating a Spark DataFrame.

Intended Audience

This lab is intended for:

Those interested in performing data analysis with Python.
Anyone involved in data science and engineering pipelines.

Prerequisites

You should possess:

An intermediate understanding of Python.
Basic knowledge of SQL.
Basic knowledge of the following libraries: Pandas.

About the author

Andrea Giussani, opens in a new tab

Data Scientist

Students

6,782

Labs

Courses

Learning paths

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.

Covered topics

Development

Python

Lab steps

PySpark - Data Manipulation

Lab Rules

Lab rules apply

PySpark - Preprocessing

Description

Learning Objectives

Intended Audience

Prerequisites

About the author

Covered topics

Lab steps

Lab Rules

SELF PACED PLATFORM

TRAINING CONTENT

JOB ROLE PATHS

CERTIFICATIONS