Data Wrangling with Pandas
The course is part of this learning path
In this course, we are going to explore techniques to permit advanced data exploration and analysis using Python. We will focus on the Pandas library, focusing on real-life case scenarios to help you to better understand how data can be processed using Pandas.
In particular, we will explore the concept of tidy datasets, the concept of multi-index, and its impact on real datasets, and the concept of concatenating and merging different Pandas objects, with a focus on DataFrames. We’ll look at how to transform a DataFrame, and how to plot results with Pandas.
If you have any feedback relating to this course, feel free to reach out to us at firstname.lastname@example.org.
- Learn what a tidy dataset is and how to tidy data
- Merge and concatenate tidy data
- Learn about multi-level indexing and how to use it
- Transform datasets using advanced data reshape operations with Pandas
- Plot results with Pandas
This course is intended for data scientists, data engineers, or anyone looking to perform data exploration and analysis with Pandas.
To get the most out of this course, you should have some knowledge of Python and the Pandas library. If you want to brush up on Pandas, we recommend taking our Working with Pandas course.
The GitHub repo for this course can be found here.
Do you remember we had two questions to be answered, right? We saw how to solve the first one. Now we focus on the second one: Can we obtain a unique data structure so that each statistical unit is identified by a unique identifier shared by each of the sources? Let us recall the concatenation we previously made.
For simplicity, I just paste the snippet here. This was an example of concatenation by row. However, the method `pd.concat()` allows to concantenate not just units of different dataframes, but more generally, variables on those structures. If we concatenate by columns, we get the following result. If you inspect the concatenation by columns of the three data frames we have in memory, and we store the concatenated object into the variable all_df_columns, we have this output.
The main insight from the above snippet is that we have lot of repeated information in the output. More precisely, we have two kind of problems: First problem is noise: the information given by the `Date` variable is repeated more than once, generating noise within the data structure; Second problem is the so-called unidentifiability: the columns `Close` and `Volume` do not have a unique relationship with the outermost index, causing problems of identifiability.
Put in other terms, the dataframe we are taking into account is not Tidy at all. To prevent its creation, we can use the `verify_integrity` option in the `concat()` method: if `True`, this prevent the result from including duplicate index values. Setting verify_integrity equal to True will cause a `ValueError` that will warn us that indexes of the new dataframe have overlapping values. Indeed, the traceback is saying that indexing has overlapping values, and it also says which of those are overlapping.
In other words, Date, Close, Volume and Symbols are columns that are repeated in the three data frames. What we can do is to aggregate the dataset by a unique row identifier, shared identically by all the data sources: this is represented, in our case, by the column `Date`. In this way, we might remove the redundancy caused by the repetition of the column `Date` within the DataFrame itself.
As for the second problem, we need to specify differently the columns name appearing in the dataFrame. Therefore, we could think about writing a custom method to perform this task. So, what custom method could we use to perform this task? For simplicity, we'll paste this method here. Please, pause the video now and take a few seconds to digest this.
This method, called `new_agg_df`, takes a pandas DataFrame and assigns a unique identifier to each column based on the extreme index value observed. In our case, the extreme index value is represented by the symbol itself. Hence, we are going to append the symbol to the column name of the original dataframe, and that is done by this operation.
Ok, we can now reshape the Tidy DataFrame by calling the custom method we just created on each single data frame object. For instance, we create three aggregated new objects, as follows. We can get an idea of the output by printing the first 5 rows of the `amzn_agg`: Indeed, we see that the new data frame has index Date but it has only two columns, denoted by the name of the stock we are considering, namely the symbol, underscore, the original column name we are taking into account, namely Close and Volume.
We can now merge the three data sources in one using the pandas `merge()` method: this method somehow extend the pandas `concat`, since using the former we generate a DataFrame from two (or more) DataFrames sharing a few columns (or indexes), whereas the latter generates a DataFrame that is simply the result of appending the columns of a dataframe to another one, with the drawback of having repeated columns in the resulting dataframe.
The function `merge()` applied on two dataframes has multiple optional arguments. For instance, let us consider the following example. Here we are trying to merge the amazon aggregated data frame with the microsoft aggregated dataframe. In particular, by resetting the index on both data sources, we can merge the two data frames based on a column that is shared by all dataframe, that is Date. Therefore, one of the possible argument of merge is on: this is a Label or list, and it basically identifies the key on which we want to merge the data sources; if we are merging two dataframe, those keys must be found in both DataFrames.
If we run this snippet, we see that we have merged the two data frames on the Date column that were made by the Amazon and Microsoft Close and Volume columns, respectively. Interestingly, this function applies an inner join, and indeed this is expressed by the argument “how” which is default equal to inner.
In general, we can merge two different data frames on two different columns’ names. Indeed, we use the arguments `left_on` and `right_on` in place of `on` when we want to merge on two different key names.
Here we can see an example is given by the following syntax. We are going to merge the amazon aggregated dataset with the microsoft aggregated dataset on two different columns, that in this case, for simplicity, have the same name, but in general, this is a more flexible syntax to define the merge operation.
In particular, since our data frames share the same index, we can set both `left_index` and `right_index` as `True`: this will tell pandas to look for the same indexes before joining the information. This translates in the following snippet: we are going to have the amazon aggregated but we are not going to reset the index anymore, but we are gonna merge directly on the index. Therefore, we are gonna have left_index and right_index equal to True.
We concatenate a merge operation on the google aggregated data object by specifying the same arguments, namely left_index and right_index equal to True. We store this object in the variable all_df_agg, and we print the first 5 rows of this object. You see now that we have a unique dataframe in which each single row describes a unique trading day made of different observations for the stocks under considerations - namely Close and Volume. Please, check the online documentation for more details.
To wrap-up, the take-home message here is the following one. If the objective is appending by rows, you should go with `pandas.concat()` function. If, instead, you want to append by columns, the best function is `pandas.merge()`, which merges on the columns in common. Furthermore, the specification `pandas.concat(axis=1)` is recommended only when the data sources are intrinsically homogeneous, that is when columns of each tidy dataset are self-explicative by themselves. Therefore, you should not concat with axis equal to 1 when you have columns that are not uniquely identified by each single dataframe.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.