This course takes a look at some of the lesser-known but highly useful methods that can be used in Pandas for advanced data analytics. We'll explore the methods available to you in Pandas to make your code more efficient through evaluating expressions and conditional iterative statements.
We'll also look at methods for time series and windows operations and how these can be used for analyzing datetime objects.
This is a hands-on course that is full of real-world demonstrations in Pandas. If you want to follow along with the course, you can find everything you need in the GitHub repo below.
If you have any feedback on this course, please write to us at support@cloudacademy.com.
Learning Objectives
- Perform iterative operations in Pandas to make your code more efficient
- Learn about evaluation expressions and how to use them
- Perform time series data analysis using a variety of methods
Intended Audience
- Data scientists
- Anyone looking to enhance their knowledge of Pandas for data analytics
Prerequisites
To the most out of this course, you should already have a good understanding of handling data using Pandas. We recommend taking our Data Wrangling with Pandas course before embarking on this one.
Resources
The GitHub repository for this course can be found here: https://github.com/cloudacademy/advanced-pandas-for-data-analytics
Welcome back. In this lecture, we are going to study a family of Pandas functions that are not well recognized among data scientists and analysts. This is the evaluation expression, denoted by the class eval.
Although this function is still not very popular among Data Scientists, it is worth noticing that this is used in the backend of many of the most popular Pandas functions. But what does it actually do?
Well, it is used to evaluate arithmetic expressions involving operations between columns of a given dataframe. Typically, we pass a string to the eval method, which contains the expression we wish to evaluate on our dataframe. The reason is that this method is based on the numexpr backend.
As an example, we might want to compute the average temperature per day. To do so, we can employ the eval on the data as follows.
Firstly, I have created a custom method called adding_average_temperature that basically appends to the data an average temperature column for each day.
In particular, we see that the eval method takes a string made of the expression row in position temp plus row in position atemp, and we divide that number by two. Since, by default, the parser is Pandas, this operation will be applied row-wise.
For more information, please check the official documentation.
We then call the adding_average_temperature method using the bike sharing data and assign it to the variable new_data. This might take a while, so I will speed up the video.
A simple inspection of the first five rows confirms that the new columns have been created using the eval method.
A member of this family is the query method. This is my favourite one: it is based on the `eval()` method, and it is used to filter the dataframe based on a given expression. Since the backend is still numexpr, we need to pass the expression to this method as a string.
For instance, we might want to filter out all the rows that do not match the condition `on the count column, in particular count greater than 5000. This can be easily done as follows. We take the data and we apply the query method and we pass a string condition that says count greater than 5000. It’s as simple as that!
We then assign this new object to the filtered_data variable.
We can easily check if the shape has changed. We write a couple of prints, checking the shape before and after the filtering, as follows.
We see that the query method has filtered out all the rows that do not match this condition.
The reason why I really like this method is that it allows you to pass global variables inside the expression, and this is inherited from the eval method in particular. Suppose we have a condition, and that condition is equal to 5000. We can create a new filtered data variable and we assign the call of the query method to the data and we pass the condition given by count greater than condition. We specify the @ symbol as a suffix of the condition variable.
Finally, we can print after filtering the new object and check that everything went well.
Furthermore, we can even pass multiple conditions in the same expression, as follows. Basically, we can say that the first condition, that we call condition one, is still 5000 and will be applied tp the count column, and then we have condition two that will be applied to the temp column, and we set this equal to 0.5.
Then, we create another filtered data new, call it filtered data new 01, and we require the data to be filtered based on conditions one and two, applied to cnt and temp, respectively.
If we want to check this dataframe, we inspect the first five rows. Temperature is always greater than 0.5 and count greater than 5000.
Note that we could have written the above condition using standard Pandas convention in this way: we filter data with respect to the count column greater than 5000 and temperature column greater than 0.5, using this syntax.
This performs the same operation as the one shown before with the query. Honestly, apart from the performance issue, I didn’t find this one so easy to read, especially when the number of conditions increases. With the query() method, everything gets simpler and more efficient.
We can even pass a list of elements. For instance, we can filter data related to spring and summer as follows. We define a new condition, which is nothing more than a list of two elements, number two standing for spring and three for summer, and then we create a new object called data season that is the data filtered based on the condition on the season column, and we require that the season is one of the elements in the list specified in the condition.
As a final check, we look at which elements of the season column were retained using the drop duplicates method.
That confirms that the query method did the job well for us.
I hope you have enjoyed using the query method as I normally do. That concludes this lecture. In the next one, we will look at various methods used to deal with time series.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.