Data Wrangling with Pandas
The course is part of this learning path
In this course, we are going to explore techniques to permit advanced data exploration and analysis using Python. We will focus on the Pandas library, focusing on real-life case scenarios to help you to better understand how data can be processed using Pandas.
In particular, we will explore the concept of tidy datasets, the concept of multi-index, and its impact on real datasets, and the concept of concatenating and merging different Pandas objects, with a focus on DataFrames. We’ll look at how to transform a DataFrame, and how to plot results with Pandas.
If you have any feedback relating to this course, feel free to reach out to us at firstname.lastname@example.org.
- Learn what a tidy dataset is and how to tidy data
- Merge and concatenate tidy data
- Learn about multi-level indexing and how to use it
- Transform datasets using advanced data reshape operations with Pandas
- Plot results with Pandas
This course is intended for data scientists, data engineers, or anyone looking to perform data exploration and analysis with Pandas.
To get the most out of this course, you should have some knowledge of Python and the Pandas library. If you want to brush up on Pandas, we recommend taking our Working with Pandas course.
The GitHub repo for this course can be found here.
Welcome back. In this lecture, we are going to dig into a very important concept that any scientist or engineer should have in their data science toolkit: multi-level indexing.
So let's take a look at, from the last lecture, this snippet here. Remember that all_df was the concatenation of three different pandas objects. To improve the readability of this dataframe, we might think of setting a new index that is made of the tuple `Date` and `Symbol`. To do that, we pass the list of columns we would like to have as index to the set_index function.
In Pandas, multi-level indexing is extremely useful as it allows to create complex and sophisticated data structures. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like pandas Series or DataFrames. Furthermore, it improves the readability of the resulting output, since it allows to aggregate results in a very meaningful way.
How can I set the index? As I said before, we apply to all_df the function all_df.set_index. set_index requires as arguments a list of columns we want to set as index. In this case, we want symbol and date. And we also force the replacement of the all_df in memory with the new one. Therefore, running this snippet will set as index the tuple (`Date`, `Symbol`) in the dataframe all_df.
A simple inspection of the first two rows shows us that now the dataframe is characterised by two columns, close and volume, and Symbol and Date describe the index of the new dataframe. Can you see a potential problem here? Well, if you look at the new dataframe, you see that each row is characterised by the tuple (`Date`, `Symbol`) as index. However, the resulting dataframe still keeps the original order, that is the order in which dataframes were concatenated.
We might think of sorting the index (in alphabetic order). To do so, we employ the method `sort_index()` on the dataframe, as follows: all_df.sort_index. And then we put the inplace is set to True, and we will replace the all_df data frame that is in memory with the new one.
So you can see here that Google is not the first observation anymore, but rather Amazon is. We can see that from these first five rows here. Ok, slicing a multiIndex DataFrame is similar to the case of univariate index. Although the logic is the same, it is worth noting that, in general, multiIndex keys take the form of tuples. For instance, if we want to access the pair made by the stock “amazon” in June 1st, 2020, we need to pass to the `.loc`, the tuple made by the symbol amazon and the date ‘2020-06-01'.
In practice, it means the following. all_df.loc Inside .loc we pass the tuple. Which is made up of AMZN, Amazon, and the date. We retain all the columns of the all_df data frame by passing, basically, the colon symbol. If, instead, you just want to access to the closing price of the stock ‘amazon’ in June 1st, 2020, we pass close, the close column. This means the same as before but instead of passing the colon, we pass the column close.
In general, a multiIndex DataFrame is made of different levels. In our case, we have two levels, easily accessible with the following syntax: we take the index of the all_df and we access the attribute levels. We see the result is a List of Lists, each list describing a level of the DataFrame.
The first level - the outermost - can be accessed in position 0, as follows. The inner level, instead, is accessible as second level, which means instead of 0, we put a 1 in there. To extract the closing price for amazon and google in June 1st, 2020, we can pass a list to the outer level of the index. Instead of passing a single symbol string, we pass a list of symbol strings, as follows. We can also use the slice command to filter a MultiIndex DataFrame. To select all closing prices in June 1st, 2020 it is sufficient to pass the `slice` command to the outermost index.
In practice, this translates as follows. So we access with loc. Now, instead of passing a symbol, we want to retain all closing prices observed in June 1st, 2020 And therefore, we use the slice command passing ‘None’ inside of it. You can use `slice(None)` to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None) is called.
As usual, both sides of the slicer are included as this is label indexing. To wrap up, we have seen how to set a multi-level index in a pandas dataframe. This is easily achievable using the `set_index()` method. We have also seen how to perform slicing on a dataframe characterized by a hierarchical index, using `.loc()`. In the next lecture, we are going to see how to merge tidy data.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.