Data Wrangling with Pandas
The course is part of this learning path
In this course, we are going to explore techniques to permit advanced data exploration and analysis using Python. We will focus on the Pandas library, focusing on real-life case scenarios to help you to better understand how data can be processed using Pandas.
In particular, we will explore the concept of tidy datasets, the concept of multi-index, and its impact on real datasets, and the concept of concatenating and merging different Pandas objects, with a focus on DataFrames. We’ll look at how to transform a DataFrame, and how to plot results with Pandas.
If you have any feedback relating to this course, feel free to reach out to us at firstname.lastname@example.org.
- Learn what a tidy dataset is and how to tidy data
- Merge and concatenate tidy data
- Learn about multi-level indexing and how to use it
- Transform datasets using advanced data reshape operations with Pandas
- Plot results with Pandas
This course is intended for data scientists, data engineers, or anyone looking to perform data exploration and analysis with Pandas.
To get the most out of this course, you should have some knowledge of Python and the Pandas library. If you want to brush up on Pandas, we recommend taking our Working with Pandas course.
The GitHub repo for this course can be found here.
Welcome back. In this lecture, we are going to use some pandas functionalities to perform data exploration on the available dataset. In particular, we focus on some data visualization techniques that allow us to get a glimpse of the data we are dealing with.
Pandas has a graphical API that is built on top of the matplotlib API. In particular, the plot() method on Series and DataFrame is just a simple wrapper around the famous matplotlib plot function. We therefore import Matplotlib using plt. The following snippet is to prevent unnecessary warnings in console. So you don't have to pay too much attention to this.
Now, let us consider the `pivot_close` dataframe we saw in Lecture six. In that dataset, we had an index identified by the `Date` columns and each columns represents a stock in the dataframe. For simplicity, I will paste here the snippet that produces that data frame When calling plot, the x-axis comes from the dataframe index. Also, note that it is good practice to plot dataframe objects with single index - in this case `Date`. Hence, we plot the pivot_close data frame as follows.
Can you spot a potential problem here? Well, data cannot be compared in their own absolute terms! Indeed, the magnitude of the closing price is not homogeneous, which means we are comparing thousands with hundreds. This translates into the fact that the time series of the Microsoft price is shifted down towards zero, whereas the magnitude is determined by the stocks whose price is expressed in thousands.
It is a good habit to standardized data before performing any data analysis. There are many ways to do that. When dealing with financial data, we are typically interested in the relative change in price, so that we can easily compare stocks. We therefore create a custom function that converts the absolute prices in their corresponding relative change with respect to the first price observed. This can be easily done as follows.
We create a custom method called daily_change which takes the argument of a single row, and returns the relative change with respect to the first price observed. We then create a new data frame containing the relative prices, as follows. We apply the custom function on the pivot_close data frame, and we store it in a new object called new_df.
Now the interpretation is much simpler: we can, indeed, infer the fact that Google is the stock which showed a slowdown in June 2020 compared to the other two stocks, which have shown a positive trend. This is confirmed in the following plot. We can go further and compute the daily returns using the pandas ptc.change method. This method simply computes the percentage change between the current and a prior element in a dataframe. For further details, please check the online documentation.
We create a custom method that computes the daily returns, as follows. We define a custom method called daily_returns which takes a single argument row, and returns the percentage change applied in that row using the pct_change function. The argument equal to 1 means we are computing the relative change between the current and the prior element in a dataframe. We then create a new data frame containing the daily returns, as follows. We apply the daily_returns function on the pivot_close data frame, and we store it in a new object called daily_df. We plot the new data frame calling the plot method on the daily data frame.
Having changed the data gives us another interpretation. We see that the daily returns are somehow driven by the same exogenous events - except for an event in the last decade of June 2020, which has affected the google stock significantly. Hence, we see that data manipulation is crucial to get insights from the data we have. However, note that there exist many ways to normalize the data, and the technique is strictly dependent on the application we decide to choose.
Now, consider the following data structure. Suppose now we are interested in only the total daily volume observed. In particular, we wish to obtain a stacked series, where each stack identifies the daily volume for each stock. This is easily achievable in pandas using the plot.area method.
In principle, the above output is fine but we can express it in a different way, using stacked bars. Each bar is going to represent the total volume observed for each stock in each trading day. In pandas, we use the plot.bar method with the argument stacked equal to True to get a stacked bar version of the plot.area method. Note that this is preferred since it is more concise and is much easier to interpret.
This concludes the lecture on data visualization with Pandas.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.