Advanced Pandas Data Analytics
This course takes a look at some of the lesser-known but highly useful methods that can be used in Pandas for advanced data analytics. We'll explore the methods available to you in Pandas to make your code more efficient through evaluating expressions and conditional iterative statements.
We'll also look at methods for time series and windows operations and how these can be used for analyzing datetime objects.
This is a hands-on course that is full of real-world demonstrations in Pandas. If you want to follow along with the course, you can find everything you need in the GitHub repo below.
If you have any feedback on this course, please write to us at firstname.lastname@example.org.
- Perform iterative operations in Pandas to make your code more efficient
- Learn about evaluation expressions and how to use them
- Perform time series data analysis using a variety of methods
- Data scientists
- Anyone looking to enhance their knowledge of Pandas for data analytics
To the most out of this course, you should already have a good understanding of handling data using Pandas. We recommend taking our Data Wrangling with Pandas course before embarking on this one.
The GitHub repository for this course can be found here: https://github.com/cloudacademy/advanced-pandas-for-data-analytics
In this lecture, we are going to investigate a topic that is really important when programming with Pandas, and that’s code efficiency.
The efficiency of your codebase is crucial for many reasons. Among those, two are worth noting:
- Scaling: as soon as your dataset shape increases, the codebase must be structured so that it scales up well.
- Clean code: sometimes an instruction can be written in different ways. Pandas allows you to write complex instructions in a compact way, thereby retaining efficiency.
An example is a for loop: this conditional iterative statement can get very complicated, and hence it might suffer from efficiency problems if not written well. Think of having a very large dataset, and iterating among the rows meanwhile performing transformations of columns: that operation could get very slow if not written well. Luckily, Pandas provides methods to perform those operations efficiently.
Let’s open a Jupyter notebook like the one you see here on my screen.
We import Pandas using the standard convention, pd, and we read the data using the Pandas read_csv function, and we inspect the first five rows using the head method.
Let’s focus on the columns we see here: the total count of daily rentals is given by the column cnt. The variable dteday describes the rental date, and then we have some self-explanatory variables such as season, year, month, humidity, and windspeed.
The dummy variable workingday tells us whether it was holiday - given by zero - or a workday - given by one. The feature weathersit describes the weather situation observed in Washington DC: those values are mapped in the REAMDE available in the GitHub repo for this course.
Then we have temp and atemp which describe the normalized temperature and the normalized “feels like” temperature in Celsius, respectively.
Finally, the variables casual and registered describe the non-registered versus the registered bike-sharing users.
Now, suppose we wish to iterate over the rows of this dataframe: a way to do that is by using the Pandas.iterrows() method: this is a generator which yields a specific row, identified by the dataframe index, as a Pandas Series. Hence, it iterates over DataFrame rows yielding an index, Series pair.
For instance, if we wish to identify the rows with a total daily count greater than 8500, then we just need to write the following syntax. Firstly, we need to iterate over the dataframe using the iterrows method. Hence, we start with the for keyword followed by the variables index and row which stand for the dataframe index associated to that particular row, stored into a Pandas series denoted by row. That pair made of index and series is taken from the original data and we apply the iterrows method on that. Then, we perform an if clause to check whether the condition on the count column is satisfied. If so, we print out the dataframe index associated to that particular example: the date identified by row in position dteday and the count values.
Iterating over a dataframe is not always the best solution. It could become really inefficient as the number of rows increases, and possibly it might not be required for specific operations. For instance, suppose we wish to compute the mean temperature between the columns `temp` and `atemp`. This operation can be done in several ways.
We firstly create a custom method called get_average_temperature that performs the above operation. This has been done for you in the following snippet.
We then iterate over the rows of the dataframe and we create a new column, called `avg_temp`.
So for index, row in data.iterrows we create a new column called average temperature which is nothing more than the application of the get_average_temperature method and we pass the row and the columns, temp and atemp.
Can you spot a problem here? You may notice that the new column is not present. This is because the creation is not done inplace and therefore we need to create a new dataframe to store the result. As highlighted in the official documentation, you should never modify something you are iterating over. Because it’s not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
To overcome this problem, we now create a new method that performs the above operation for each row, but now we append this new row to a new dataset.
Let’s create my new method which requires as its argument the bike sharing data.
As I said, we perform the above operation for each row but now we append the new row to a new dataframe object called new data, which has the same structure of the orginal dataframe, and as soon as we create that new column, we append that row to the new dataframe.
We finally return the new data.
Here, we used the iterrows method to dynamically create a column inside the Pandas dataframe, called average temperature. We do, however, need to create a new dataframe to which we append the new row.
Now the question is: how long does it take to perform this operation? Is this operation efficient?
To answer this question, we use the well-known python timeit() method, which is used to test how fast your code is. In general, it is used to get the execution time taken for a small code chunk. The library runs the code statement 1 million times (by default) and provides the minimum time taken from the set. It is a useful method that helps in checking the performance of your code.
We now run the method above by passing the bike sharing data and we check the minimum time required to perform that operation. We store the result into the variable timeit_res. The syntax goes as follows. We pass the percentage symbol followed by timeit and we pass the argument output denoted by a lowercase “o” followed by the method we want to test. This might take a few moments. I will speed up the video.
Ok, we see that the minimum time to run that chunk of code was 4.45 seconds.
Although using iterrows is fine – and that is the way in which we perform iterative structure in Pandas – it’s not recommended if you’re aiming to modify the data.
Let me assign the modified data to the variable new_data, and I’ll also show the first five rows.
Hence, if the goal is to modify the starting dataset, the iterrows method is not recommended. To make the code more efficient, we use the apply method.
The Pandas apply method applies a function along an axis of the DataFrame. Objects passed to the function are Series objects the index of which is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). There are two main arguments: function, which stands for the Function that we apply to each column or row, and axis, which is the Axis along which the function is applied – the default is zero which means we apply the function to each column. We can apply the function for each row by setting this argument equal to one.
In our case, we apply the function by row, as follows. We pick the original data and we apply the get average temperature method like this: we use an anonymous function lambda and we pass get average temperature to apply with argument x – which is the row – and the temp and atemp column names, and since we want to perform this operation by row, we specify the argument axis as equal to one.
Finally, we store this new column to the data under the name average temperature apply just to highlight we are using an apply to do so.
A simple application of the timeit function shows that this method is better than a simple Pandas iterrows. To do so, we call the timeit on that chunk of code, and not surprisingly we get an outstanding result: we move from 4.5 seconds to just 8.26 milliseconds!
Before concluding this lecture, I would like to show you another method that is used to perform iterative structures in Pandas: itertuples
As highlighted in the official documenation, the iterrows method does not preserve dtypes across the rows: indeed iterrows returns a Series for each row, but dtypes are preserved across columns for DataFrames. To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns named-tuples of the values and which is generally faster than iterrows.
Do you remember the first application of the iterrows method we saw in this lecture? Well, now let us replicate it using itertuples.
To do that, instead of returning a pair of index series, we just have a series identified by row and we perform the loop using itertuples. We then use index columns (in this case cnt has index 16) and if the condition is satisfied, then we print out the index (now in index 0), the date (in index 2) and the total count.
Note that when using itertuples we do not use column names but instead their corresponding indices. Notably, itertuples is faster than iterrows. This can be checked as an exercise for anyone interested.
That concludes the lecture on conditional iterative structures in Pandas. In the next one, we are going to look at a really cool way to program in Pandas efficiently using the evaluation expression family. See you there.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.