Data Wrangling with Pandas
The course is part of this learning path
In this course, we are going to explore techniques to permit advanced data exploration and analysis using Python. We will focus on the Pandas library, focusing on real-life case scenarios to help you to better understand how data can be processed using Pandas.
In particular, we will explore the concept of tidy datasets, the concept of multi-index, and its impact on real datasets, and the concept of concatenating and merging different Pandas objects, with a focus on DataFrames. We’ll look at how to transform a DataFrame, and how to plot results with Pandas.
If you have any feedback relating to this course, feel free to reach out to us at firstname.lastname@example.org.
- Learn what a tidy dataset is and how to tidy data
- Merge and concatenate tidy data
- Learn about multi-level indexing and how to use it
- Transform datasets using advanced data reshape operations with Pandas
- Plot results with Pandas
This course is intended for data scientists, data engineers, or anyone looking to perform data exploration and analysis with Pandas.
To get the most out of this course, you should have some knowledge of Python and the Pandas library. If you want to brush up on Pandas, we recommend taking our Working with Pandas course.
The GitHub repo for this course can be found here.
Welcome back. In this lecture, we are going to look at the concept of tidy data. In his seminal paper published in 2014, Hadley Wickham introduced the concept of Tidy Data, which is a blueprint of how things should be defined, in terms of the structure and semantics of a dataset. To frame the problem, let us consider the following quote from one of the greatest Greek philosophers, Aristotele, in his book Nicomachean Ethics (Book 2). “For men are good in but one way, but bad in many.” This quote translates well into the concept of tidy data: all tidy datasets share the same characteristics, but bad datasets instead are bad in their own way.
In general terms, a dataset is a collection of values described in terms of variables. In particular: A variable contains all values that measure the same underlying attribute across statistical units. A real-life example of a variable is, for example, the height, gender or age of a person. An observation is the outcome observed on the statistical unit, namely a person, for a precise attribute. Put in other words, datasets are typically represented as tables made up of rows and columns.
To frame this issue, let us consider the following data structure. This is a very common example in practice. In this case, we have unique rows, identified by three stocks, and each column uniquely identifies a calendar time in which the price was observed. However, there are many ways to structure the same underlying data.
Let us consider the following example. In this case, rows and columns have been transposed. The data value is the same, but the layout is different. Alternatively, the same data can be described in the following fashion. We identify four different attributes, namely `Date`, `Close`, `Volume` and `Symbol`. Each statistical unit is uniquely identified by the aforementioned quadruple.
A natural question is: which structure is the most suitable to represent the data? Well, there is no leading answer. I would say, it depends on the business objective we have in mind. Therefore we have to answer a fundamental question: what is a tidy dataset? Tidying is the process of consistently mapping the meaning of a dataset to its structure.
In particular, as Wickham stated, we can say that `tidy data` are characterized by the following characteristics: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each value is a cell. So, given these features, which of the previous datasets were consistent to this definition? Table 3 is the tidy version of Table 0.
Each row represents a statistical observation, i.e. the price and volume of a stock, and each column is a variable. This kind of data representation is often called `stacked` or `record` format, since each statistical observation represents a unique record.
Note that real datasets do often violate the definition of tidy data in (at least) one of its characteristics. Tidy datasets should be the starting point for any analyst to perform more complex operations with a dataset.
To wrap up, in this lecture we have understood that data might be really messy in the wild. We have looked at the Tidy Data framework that is used as a benchmark to measure the consistency of a dataset before its usage for more complex operations.
In the next lecture, we are going to look at one family of possible operations that we might perform on different source of data: Merging and Concatenating (Tidy) Data. See you there!
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.