Advanced Pandas Data Analytics
This course takes a look at some of the lesser-known but highly useful methods that can be used in Pandas for advanced data analytics. We'll explore the methods available to you in Pandas to make your code more efficient through evaluating expressions and conditional iterative statements.
We'll also look at methods for time series and windows operations and how these can be used for analyzing datetime objects.
This is a hands-on course that is full of real-world demonstrations in Pandas. If you want to follow along with the course, you can find everything you need in the GitHub repo below.
If you have any feedback on this course, please write to us at firstname.lastname@example.org.
- Perform iterative operations in Pandas to make your code more efficient
- Learn about evaluation expressions and how to use them
- Perform time series data analysis using a variety of methods
- Data scientists
- Anyone looking to enhance their knowledge of Pandas for data analytics
To the most out of this course, you should already have a good understanding of handling data using Pandas. We recommend taking our Data Wrangling with Pandas course before embarking on this one.
The GitHub repository for this course can be found here: https://github.com/cloudacademy/advanced-pandas-for-data-analytics
Welcome back. In this lecture, we are going to look at methods necessary to tackle data analysis based on time series.
You should remember that we have a column called dteday in our dataframe. That column is of the object type, as we can see from a simple inspection using the info method.
When dealing with time series, it is better to have a datetime reference: on the one hand, Pandas operations involving datetime objects are more efficient, and therefore helps the base code to perform complex operations faster.
On the other hand, there are many functions that allow us to perform datetime operations easily, and therefore it is good practice to convert a date from an object type to datetime using the Pandas to_datetime function as follows. We basically apply to the dteday column the .to_datetime() function using an anonymous function for each element of the column dteday. We also assign to the same column this new object.
A simple inspection with the info method confirms that dteday is now of type datetime.
Also, when working with time series, it is good practice to set the datetime as the index of our dataframe. To do that, we apply to the data the function set index by specifying the column we wish to set as index – in our case dteday – and we do this operation inplace.
Ok, now we can look at various Pandas functionalities to deal with time series.
First, we look at a method that allows us to compute the Percentage change between a current and prior element in a dataframe.
More precisely, this is done by the Pandas pct_change method. This computes the percentage change from the previous row by default. This is useful in comparing the percentage change in a time series of elements.
To better understand when it is useful to apply this method, let us consider the series of registered vs unregistered users, identified by the columns registered and casual.
We want to compute the percentage change in those series, and possibly plot them.
To do so, let us build a custom method that does the job for us, and call it daily_change
We then apply the method to each single row to compute the percentage change from the nprevious row in the column casual. This information is useful since we might want to scale the data in a way that does not affect our analysis.
We basically take the casual column and we apply the daily change method to it using an anonymous function and we do that for each single row, and we assign this to the new column casual_pct of our dataframe. We do the same also for the registered column and we assign this to the new “registered_pct” column.
We finally plot the two series. We use Pandas and we show the result for a restricted time window by filtering observations from Jan 1st 2021 to Dec 31st 2011. We only take into account the casual and registered pct columns and we plot these series by setting the secondary axis – in this case y – by setting the registered_pct column, and we also set mark right argument to False, the grid to True, and finally we set the figsize equal to 10 and 8 inches.
Obviously, before running this cell we have to import matplotlib.pyplot as plt and then we show the plot.
Another useful method is diff. Unlike Pandas pct_change, this computes the first discrete absolute difference of element. In other words, it calculates the difference of a Dataframe element compared with another element in the Dataframe, and the default is the element in the previous row.
So, for instance, if we want to get the difference between elements in the cnt series, then we just need to apply the diff method to that series, and we assign this operation to our dataframe inside the column cnt_diff.
We store the results into column cnt_diff. To see the results, let’s show the first 10 rows of the columns cnt_diff and cnt, as follows.
Obviously, the first row of cnt_diff is Null since there is no value to compare with. Overall, the check is consistent and so we can be satisfied with that.
Note that we don’t always have to compute the difference between the current element and the previous one. For instance, we can compute the difference between the current element and the 5th previous one by specifying the argument periods equal to 5. We store the results in the column cnt_diff_5d.
We expect the first five values of the column cnt_diff_5d to be null, and that is confirmed by a simple inspection with the head method, whereas the differences after that will be computed between the current row value and the 5th previous value.
If, instead, we would like to perform this difference in the other direction (that is between the current row and the following one), we need to proceed as follows. We store the result into the column cnt_diff_following and we apply the diff method by specifying periods equal to -1.
Pandas contains a compact set of APIs for performing windowing operations - i.e. an operation that performs an aggregation over a sliding partition of values.
Pandas supports 4 types of windowing operations: Rolling, Weighted, Expanding, and Exponentially Weighted windows.
Here we focus on the standard rolling window, which computes a Generic (fixed or variable) sliding window over the values.
The use of other windows might depend on the task you want to carry out, and they are mainly used in finance. For more details please check the official documentation.
Windows are useful to compute the Moving Average (MA in short) of a series of data. Computing the moving average is useful since it smooths out the investigated series, and the longer the window, the smoother the resulting new series. In practice, this operation has the effect of cleaning the original series from noise given by the trend, for instance.
Here, we compute the moving average for the total count of observations over time, identified by the column cnt.
To do so, we use the Pandas function rolling which requires the argument window to be specified. For instance, we might want to compute the five-day moving average. This means that we take the observations from the previous five days, and take the average.
We also do the same for 20 and 50 days as follows. Be careful to change the window argument consistently.
To get an idea of those series, let us plot them all together, using the Pandas plot method. We plot 'ma_50' by specifying figsize equal to (10,8), and we repeat the same arguments for 'ma_15' and 'ma_5' as well. Finally, we show the legend and show the plot.
Finally, it is worth mentioning that there are many functions that allow us to deal with datetime values in Pandas. Among those, let’s focus on the Pandas date_range() function that returns a sequence of datetimeindex values.
This function requires several arguments. For instance, if you want to generate dates between say Jan 1st 2020 to Jan 10, 2020 you just need to specify the start argument, which is the Left boundary for generating dates, in our case 01/01/2020, and the end argument, which is the right boundary for generating dates - in our case 01/10/2020.
We have generated a datetime index series with a daily frequency.
If we want to generate 5 dates starting from Jan 1st, 2020 we can avoid specifying the end by setting periods=5.
If instead, we want to generate 5 dates related to 5 different consecutive months, we then specify the argument frequency, denoted by freq, which is set to daily by default. We change this to month, by setting freq=’M’
Note that Multiples are allowed. The following produced 4 dates for a quarter – and we specify periods equal to 4.
Finally, there is an extra argument called closed that controls whether to include the start and end that are on the boundary. The default includes the boundary points on either end.
We use closed='left' to exclude the end if it falls on the boundary.
Otherwise, we Use closed='right' to exclude the last element if it falls on the boundary.
We change the periods in the end argument, and we set this to be equal to Jan 10th 2020. Now the last element has been removed.
Ok, that concludes this lecture.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.