1. Home
  2. Training Library
  3. Big Data
  4. Courses
  5. Data Wrangling with Pandas

Data Wrangling with Pandas

The course is part of this learning path

Wrestling with Data
course-steps
2
certification
1
lab-steps
2
play-arrow
Course Introduction
Overview
DifficultyIntermediate
Duration1h 7m
Students56
Ratings
3.4/5
starstarstarstar-halfstar-border

Description

In this course, we are going to explore techniques to permit advanced data exploration and analysis using Python. We will focus on the Pandas library, focusing on real-life case scenarios to help you to better understand how data can be processed using Pandas.

In particular, we will explore the concept of tidy datasets, the concept of multi-index, and its impact on real datasets, and the concept of concatenating and merging different Pandas objects, with a focus on DataFrames. We’ll look at how to transform a DataFrame, and how to plot results with Pandas.

If you have any feedback relating to this course, feel free to reach out to us at support@cloudacademy.com.

Learning Objectives

  • Learn what a tidy dataset is and how to tidy data
  • Merge and concatenate tidy data
  • Learn about multi-level indexing and how to use it
  • Transform datasets using advanced data reshape operations with Pandas
  • Plot results with Pandas

Intended Audience

This course is intended for data scientists, data engineers, or anyone looking to perform data exploration and analysis with Pandas.

Prerequisites

To get the most out of this course, you should have some knowledge of Python and the Pandas library. If you want to brush up on Pandas, we recommend taking our Working with Pandas course.

Resources

The GitHub repo for this course can be found here.

Transcript

Welcome, my name is Andrea Giussani and I am going to be your instructor for this course on Data Wrangling with Pandas. In this course we are going to explore techniques to permit advanced data exploration and analysis using the Python Language. We will make use of the Pandas library, and we will focus throughout the course on real-life case scenarios.

Pandas is probably one of the most important libraries in Python for Data Science and Analytics. It was originally developed by Wes McKinney in 2008, and became a NumFOCUS sponsored project in 2015. In particular, we will explore the concept of Tidy Datasets, the concept of Multi-index, and its impact on real datasets, the concept of concatenation and merging of different Pandas objects, with a focus on DataFrames. We’ll look at how to transform a DataFrame, and how to plot results with Pandas.

Before taking this course, we strongly encourage you to take our Working with Pandas course, the link to which is included in the transcript of this lecture. Note that the materials, as well as the data used in this course, are available at the GitHub repository related to this course.

As a scientist or engineer, you are going to spend an incredible amount of time in cleaning and preparing data. Indeed, people tend to spend approximately 80% of a standard data analysis pipeline performing these tasks. So a natural question is: why is data manipulation so important? To frame this problem, let us focus on the following dataset.

Each row of this dataset contains information about the closing price and volume for the stocks `GOOGLE`, `MICROSOFT`, and `AMAZON` for each trading day observed in June 2020. Can you spot a potential problem here? Well, there are many.

First of all, the column Date is repeated three times, which is not providing useful information but only noise. Secondly, the columns' names are not unique, which does not help us in identifying the desired information at a glance. Third, the magnitude of the columns Close is different among them, which might be misleading in case of a financial data analysis exploration - say if you are interested in investigating which stocks have surged the most. It is therefore a necessary condition before starting any statistical quantification to reshape the available data in the most appropriate way.

This course will help you to understand why you should perform data manipulation on your data source, when you should apply data techniques to reshape your data source, how you should represent your data according to the scientific or business question you have in mind. Before diving into such different techniques, it is important to understand the fundamental concepts beyond the process of structuring datasets to facilitate analysis.

We will focus therefore on the concept of tidy data in the next lecture. So I'll see you there!

Lectures

Tidying a Dataset - Merging and Concatenating Tidy Data - Multi-Level Indexing - Merging Tidy Data - Transformation of a Dataset - Plotting Results with Pandas - Course Summary

About the Author

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.