Data Wrangling with Pandas
The course is part of this learning path
In this course, we are going to explore techniques to permit advanced data exploration and analysis using Python. We will focus on the Pandas library, focusing on real-life case scenarios to help you to better understand how data can be processed using Pandas.
In particular, we will explore the concept of tidy datasets, the concept of multi-index, and its impact on real datasets, and the concept of concatenating and merging different Pandas objects, with a focus on DataFrames. We’ll look at how to transform a DataFrame, and how to plot results with Pandas.
If you have any feedback relating to this course, feel free to reach out to us at firstname.lastname@example.org.
- Learn what a tidy dataset is and how to tidy data
- Merge and concatenate tidy data
- Learn about multi-level indexing and how to use it
- Transform datasets using advanced data reshape operations with Pandas
- Plot results with Pandas
This course is intended for data scientists, data engineers, or anyone looking to perform data exploration and analysis with Pandas.
To get the most out of this course, you should have some knowledge of Python and the Pandas library. If you want to brush up on Pandas, we recommend taking our Working with Pandas course.
The GitHub repo for this course can be found here.
Welcome back. In this lecture, we are going to perform advanced data reshape operations using pandas. Recall the original `stacked` dataframe that we saw in the previous lecture. In particular, we saw different techniques that allow us to combine different data sources. An example is the concat function we have replicated here. However, there might be sometimes the need to reshape an existing dataframe in terms of its columns and index.
As an example, suppose we wish to do time series analysis with the above dataframe. A better representation of the data above would be one in which date identifies individual observations as index, making each row uniquely identify a single observation. To reshape the data into this form, we use the `pivot()` method.
Pivoting is extremely useful when multiple observations for the same statistical unit are observed: an example is represented by longitudinal data in medical studies, where an individual outcome is observed repeatedly with a specific frequency, or more generally, in a time series. We could reshape the dataset in order to render it more readable and identify, wherever possible, some relationships in the data.
The pandas `.pivot()` method easily allows us to perform the aforementioned reshape operation by specifying three arguments. The first argument is the `index` argument, which represents the column of the original dataframe we wish to have as a new index. For instance, if we take all_df and we apply the pivot method, we can specify the index argument as Date.
Another important argument is the `columns` argument, which represents the column of the original dataframe we wish to have as columns to visualize the data values. In our case, we can set this as the Symbol in order to have each single stock as a column of the pivoted data frame.
Finally, an important argument is the `values` argument. It can take a list or a string, and it represents the columns in the original dataframe that we wish to have in the reshaped one. For instance, we are now just interested in pivoting the closing price , so we are gonna pick the Close column as values. After this operation we sort the index , and we store the new object in the variable pivot_close.
A simple inspection with the head method shows that the data (that are represented here) are now in a different shape: each column now represents a single stock, and each row represents a date. The values that are shown in the table represent the closing price observed in that trading date. Note that if the values argument is omitted, and the original DataFrame has more than one column - in our case Close and Volume- then the resulting pivoted DataFrame will have hierarchical columns whose outermost level indicates the respective value column.
To take an idea of this argument, let us copy and paste the snippet in cell 2 and we are not gonna specify any values argument. We call this object pivot_close_volume and we print the first 5 rows. You see that the resulting data frame has hierarchical columns, and the outermost levels are basically expressed by the columns of the original data frame. Interestingly, the same result could be obtained by applying the `unstack()` method on the input dataframe.
So a natural question is: when should I use the `unstack()`? In general, the `pivot()` method is not recommended to be applied on DataFrames that present a multi-index. You'll get better performance with a multi-level index if you use `unstack()` method. Hence, we should use the latter when we wish to pivot a level of the hierarchical row index to the column axis, producing a reshaped DataFrame with a new innermost level of columns.
Let us consider the following snippet: we create another df, which is basically the input data frame and we set the index to be made of Symbol and Date. We generate a new dataframe, and we call it unstacked_df, which is nothing more than another_df unstacked by the symbol column.
To do that, we unstack by the level of the index provided by the symbol. If we try to inspect the first five rows, we get the following result. We take out the "s" and we get this result here. The opposite of `unstack()` is `stack()`: with `stack()`, it is required to specify the column level we wish to move as a row index. Therefore, we pick the unstacked df and we stack the level that we wish to move from column to row index (in this case Symbol), and we can print the first 5 rows to see that basically, we have the symbol as the inner level of the hierarchical index.
So looking at this, can you identify another reshape here? Well, by stacking the `Symbol` from the columns as an inner-level row index, we might want to change the order of the hierarchical row index. To do so, we use the `.swaplevel()` method: this method is basically used to safely change, when dealing with multi-level index dataframes, the levels of the multi-level index dataframe. This means the following: we store into the variable stacked_df the stacked dataframe, and then we swap the symbol index with the date index . To do so, we apply to the stacked df the swaplevel, which takes 2 arguments: the level we wish to have as outermost level, in this case, Symbol, and we put that first, and then the level we wish to have as inner level, in this case, Date, and we store this object in the variable swapped_df.
So now we print the first rows of the dataset and see that now the symbol describes the outermost level in the hierarchical index, and the date is the inner level. We can improve the resulting output by sorting the hierarchical index. So we apply the sort_index function to swapped_df, and if we look at the first rows, we see that now the data frame has a much better interpretation, since amazon is not repeated as many times as it appears in the data frame, but just once.
Ok, let us now move to another pandas function: melt(). This is used to transform a dataframe from a `wide` format to a `long` format. When we say `wide` we mean a dataframe that has a rectangular shape, with a large number of column values. Instead, a `long` format is when columns' variables are reshaped as observations.
When using `melt()`, it is important to specify the argument `id_vars`, which represents the columns of the original dataframe we wish to reshape. Therefore, we apply the melt of the all_df data frame as follows: we pick the dataframe, and we apply melt, and we specify the argument id_vars, which requires a list of columns we want to reshape - in our case, symbol and date.
Please note that there is another argument, value_vars, which represents the columns of the original dataframe we wish to keep unpivoted. `value_vars` is implicitly obtained by the `id_vars`, since it equals to all columns that are not set as `id_vars`. Therefore, in this snippet, we don’t have to specify it since it is gonna be inferred by the id_vars argument. Running `melt()` on the original df has changed its format, from wide to long, as we can see here. So here we have 132 rows while before we had 66.
\From the previous output, we see that we ended up having two non-identifier columns, called `variable` and `value`. However, those two names can be easily changed by supplying the `var_name` and `value_name` arguments. Please, check the online documentation for further details.
We now move to another important concept; frequently, the output of the pivot method is not sufficient to answer to a particular business (or scientific) question. For instance, suppose your boss asks you to produce a report in which you have to provide the average price in June 2020 of a Portfolio of stocks. The pivot does not allow for aggregating numerical data.
So the question is: how can we achieve this with pandas? In principle, we can obtain the desired aggregated output by using the pivot method. But this requires a little bit of data wrangling, as follows. We pick the original data frame and we apply the pivot method to it.
Note that we set as index the Symbol column and as columns we put the Date. We then create an empty dataframe, and assign mean_pivot to the object. We then add the Symbol column to it. We also add the column average_price column by applying a lambda function that computes the mean for each row of the dataset all_df - for each stock, we compute the mean of price observed for each `Date` which represents a column variable.
Finally, we set the index as Symbol for the new data frame. The pandas function `pivot_table()` can be used to achieve this aggregated result. This is easily achievable with just one line of code using the `pivot_table()` method. This takes a few arguments. To the original dataset, all_df, we apply the pivot_table function which requires the values argument, which is a column or a list of columns to aggregate.
So in this case, the Closing price, the index, which a column or list of columns we wish to have as index of the aggregated dataframe - in this case, Symbol - and finally the aggfunc argument. So this argument is a function or a list of functions we wish to apply for aggregation. So, in this case, we are interested in the mean.
If we run this snippet, not surprisingly, we get the same output as before. However, one thing to note is that the column names of the aggregated column are different among the two outputs. Note that if the argument `values` is not specified, then the aggregation is applied on all the columns present in the original dataframe. So, in other words, we're going to apply to the dataframe all_df a pivot table and this pivot table has only the index and aggfunc specified.
Here we obtained an aggregation for both the volume and close columns that are in the original data frame. Finally, we can pass a list of aggregation functions to the pivot_table. So for instance, we pass as a list the 'mean', 'median', as well as the 'max', 'min', and the ‘std’. If we run this... You can see the snippet computes the close column, the quantities given by the 'mean', 'median', ‘max' value, ‘min' value and ‘standard deviation observed for each single stock.
We conclude this lecture with another useful pandas function: groupby. Now, by the term `group by` we refer to a process that is made of (at least) three steps. First, splitting the data into groups based on some criteria; second, applying a function to each group independently; and third, combining the results into a data structure.
The `group_by()` method in pandas is therefore similar in its objective to the `pivot_table` that we saw previously, but its logic resembles the one used in the SQL language when we use the keyword `GROUP BY` to aggregate a quantity with respect to a few identifiers. The following snippet highlights the syntax of the `group_by()` method. We take the all_df dataset and we group the observations on the Symbol.
After that, we retain just the Close column, and we take the mean of the observed values. The results are shown here. Note that the output is not a pandas dataFrame but a Series.
To transform it into a DataFrame, we apply the method `to_frame()`, as follows. The `agg()` method can be used to perform different aggregations at once. For instance, let us aggregate the results on three quantities: `'mean', 'median'` and `max`. To do that, we basically take the following snippet. So we add the agg method, and we put in mean, median, and max. Please note that now the output is of type data frame since we are aggregating by more than one function, and therefore we have more than one column.
Interestingly, we used the agg method here and we passed a list of functions. Sometimes, computing a specific quantity on a variable does not bring any useful information. Suppose, for instance, we are interested in the `mean` or `median` value for the `Close` price but, instead, only on the `max` observed for the `Volume` column. In other words, we are not interested in computing the `mean` volume, but this is of interest only for the `Close`.
Is it possible to specify this in the `agg()` method? The answer is yes, and it is achieved using a Dictionary, where each key represents the column we want to aggregate on, and the values are the aggregated functions we wish to apply on that specific key. The following snippet produces the desired output. You see that the mean and median have been computed only for the Close column, and the maximum value is retained for each single stock for just the Volume column.
To wrap up, we have seen different ways of transforming a dataframe: pivoting, melting and grouping are categories of transformations that can be easily performed in pandas. In the next lecture, we are going to extend our analytics toolkit by performing data analysis and exploration on the transformed data. See you there.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.