Working With PANDAS
The course is part of this learning path
Pandas is Python’s ETL package for structured data and is built on top of NumPy, designed to mimic the functionality of R data frames. It provides a convenient way to handle tabular data and can perform all SQL functionalities, including group-by and join. Furthermore, it's compatible with many other Data Science packages, including visualization packages such as Matplotlib and Seaborn.
In this course, we are going to explore Pandas and show you how it can be used as a powerful data manipulation tool. We'll start off by looking at arrays, queries, and dataframes, and then we'll look specifically at the groupby function before rounding off the course by looking at how the merge and join methods can be used in Pandas.
If you have any feedback relating to this course, feel free to tell us about it at email@example.com.
- Understand the fundamentals of the Pandas library in Python and how it is used to handle data
- Learn how to work with arrays, queries, and dataframes
- Learn how to use the groupby, merge, and join methods in Pandas
This course is intended for data engineers, data scientists, or anyone who wants to use the Pandas library in Python for handling data.
To get the most out of this course, you should already have a good working knowledge of Python and data visualization techniques.
So the index in a data series is, in fact, something that we can dictate much like a dictionary. We can decide to index by anything we want really. We can index by date, time, objects, whatever we feel like. So looking at the last index we used, what I want to do is I want to call these people patients.
So what I really want is a list of strings that say patient one, patient two, patient three, patient four, and so on. To do this, I can use comprehensions. If I wanted to have patient plus string for I in range one to 10, what this is going to give me hopefully is the list of strings that say patient. So patient one, patient two, patient three.
Now I can re-index my heights by calling DS heights. I can set a new index if I want to. So index equals patient. Now, if I run this and have a look at it, and they have an index for patient one, patient two, patient three, et cetera, et cetera, et cetera, associated with the values within the collection.
If I just want to specify the index for one, two, three, I put string with patient age. I could ask for all the values between patient four and patient date, slicing between string indices. We're just creating these as keys, but there is an inherent ordering within my key.
Let's have a look at the few of the methods that we have available to us. A simple one is sorting values. It comes in useful when we want to do various tasks. If I have SOR values, it requires that I pass in a parameter dictating which way I want to sort. And by default it will sort in ascending order.
If I said ascending is equal to false, then that will switch the order. Not a Numbers are not ordered. So this will switch the order in which we are sorting. So we're just leaving the Not numbers by themselves, and we're sorting 204, 126, 123.
By default, ascending is equal to true, and this is just a quirk that you need to be aware of. Then we have value counts, and value counts are very useful. They tell us how many of each value that we have. And by default, it's not going to consider a Not a Number as a value.
So I need to drop, set dropna as false. So then we also get an idea of how many Not a Numbers we have. By default it doesn't consider them to be values, true values. You can use this to do something quickly, like plot kind equals bar, and get a quick and easy bar plot, the count of each value that I have within my data series.
So Pandas integrates with Matplotlib, which we haven't had look at yet. We're going to look at that later. It integrates very well with Matplotlib, and you can quickly plot quite easily. You can make it a lot prettier than this, but this is just an example plot here on the screen. Let's have a quick look at value counts.
So value counts are very good because they show the count of a value in an index. So here the index is now the values from the index before. And the value is now the count of that value in the index. You have seven Not a Numbers. You have one value of 123 and so on. If we want to add a value, a single value to our data series, I can add in a value which is going to be adding in this case, for example, patient zero. And this is going to be equal to 100. For example, this will add a new entry into my new data series.
Good, so I've got patient zero. He gets stuck on the end by default, and I can change the ordering if I want. I can ask how much missing data, or which of my values are now missing. Isna. Isna gives me a data series of Boolean saying whether you are or are not missing a value. A point of interest is if I put in none in the data series, if I give it a new entry, make it 100, it has defaulted to give it a Not a Number. So I have Not a Number and I have none. None is Python's missing data value. Whereas Numpy.NaN is Numpy's missing data value, but is seen as equivalent within Pandas.
So I ask whether a value is NA I run the isna query .isna, then both Not a Numbers and nones are considered missing. So it doesn't matter if you have a Numpy missing value, or if you've got a Python missing value, Pandas will pick up on both.
I could do value counts as well. So I could do this true false. So this is giving me the count of those which are missing and those that aren't missing. So what is true plus true? It's two. So what we could do to our data series to get an idea of the counter missing values is sum up all the trues. Because trues are missing values and falses are not missing values.
So we can add them up. We have nine different values when we add the trues together. True is one, so every time we encounter a true, we're adding one to our count, and this is a count of missing values. So height.isna.sum, you'll see this very often. This is just the total number of missing values.
Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.