Pandas is Python’s ETL package for structured data and is built on top of NumPy, designed to mimic the functionality of R data frames. It provides a convenient way to handle tabular data and can perform all SQL functionalities, including group-by and join. Furthermore, it's compatible with many other Data Science packages, including visualization packages such as Matplotlib and Seaborn.
In this course, we are going to explore Pandas and show you how it can be used as a powerful data manipulation tool. We'll start off by looking at arrays, queries, and dataframes, and then we'll look specifically at the groupby function before rounding off the course by looking at how the merge and join methods can be used in Pandas.
If you have any feedback relating to this course, feel free to tell us about it at support@cloudacademy.com.
Learning Objectives
- Understand the fundamentals of the Pandas library in Python and how it is used to handle data
- Learn how to work with arrays, queries, and dataframes
- Learn how to use the groupby, merge, and join methods in Pandas
Intended Audience
This Course is intended for data engineers, data scientists, or anyone who wants to use the Pandas library in Python for handling data.
Pre-requisites
To get the most out of this course, you should already have a good working knowledge of Python and data visualization techniques.
Resources
The dataset(s) used in the course can be found in the following GitHub repo: https://github.com/cloudacademy/practical-data-science-python
So we're going to have a look at Pandas. Pandas is the go-to data manipulation library of Python. It integrates very heavily with NumPy. So the operations that we learn how to do in the previous course with NumPy, we can carry out on Pandas objects, because Pandas will treat them as NumPy. But as a sort of slightly nicer, easier to use interface within the context. And you have methods that allow you to access shall we say, more traditional data manipulation tasks. All these things are built into data frames, whereas NumPy is very much about computation.
So to demonstrate this, I'm going to pull in some data. We need to import Pandas. The standard alias for Pandas is PD. So we import NumPy as NP and import Pandas as PD, and these are standards, and you'll see this in every sort of program. We're going to create something called a data series. So we're going to start off with looking at data series, which are one-dimensional data manipulation objects. They're essentially lists with advanced indexing objects and a vast array of functionalities they're associated with.
Okay, so how are we going to make this data series? Well, what I'm going to do is I'm going to create a series called a ds underscore, and I'll call it heights. I'm going to create a data series by calling it pd.series, which is the series constructor. And then I'm just going to pass in my heights NumPy array.
I'm going to call flatten on them, just because series will get annoyed if I don't flatten, I need a one-dimensional array of data. My heights is by default going to be two dimensional. So if I just squash it down, then we're all good. If this was a data frame, this would be fine. We wouldn't call flatten. But because it's a series, we're going to call flatten on it. And I want to, I think only take the first 10 elements of this array. Otherwise, I'll be printing out lots and lots of things onto the screen every single time and that's not ideal.
So we should end up with something that looks like this. So here we can see we have a data type associated with it, which is int32. What else do we get when we print it out? This will look exactly the same when we print it as when we simply output it like this, nothing changes.
So we've got our data type, we have our values. And then we have by default, an index associated with these values. So it's more of a directly indexed list. In fact, the values within it are a collection of key value pairs. So it's not simply a list, there are keys and values.
We can have a look at what our index looks like by calling ds_heights.index. It just tells us that our index starts at zero, stops at 10, and goes in steps of one. That's our default range index objects that we've been given when we've created this. And we can pull out the values by calling ds_heights.values. And what type of object do I get when I call .value? Well, I get an array. It's a NumPy array, which we use when we want to pull out this sort of hard data from a data series.
Now to pull the fourth element of my data series, I would use brackets and four. And if we want the final element of my data series, then I would need to call the actual element, which is the number nine here. Now I can actually make it as I can get out the final elements by calling negative one and a colon. That will give me the final elements of the data series as a new data series.
So based on this, we can slice between keys. So if I have my height here, and I indexed by true and false times five, we see what happens when we multiply a list by a number. So we end up with a list, this list five times. So if you do true false times five, you get true false, true false, true false, true false, true false. So I've got true and false; 10 trues, and falses here.
I've stopped that into the brackets for this data series. So you'll notice a pattern in the index from before. So the index follows the two times table, we have getting out every second element. When a data series sees a value of true for an index, it's going to give us that value out. And if it's false, it's not going to give that value. So it's selectively returning based upon the Boolean that I'm passing into it.
Now, this is the raw version of doing this. If you wanted alternating elements out, you could always do that as well. Similarly, I can ask a question to my data series. I could ask which of the elements in my data series are greater than 160. It hasn't actually done anything to the underlying structure.
If I take ds_heights, it's the same as it always was. So it's still just a collection of these. If I ask the question, which of the elements are greater than 160? It decides if it's true or false.
So we've got a Boolean data type and we now have associated with each of the keys, if a value is true or false, whether it is or is not greater than 160. So it's actually a query, of here is a collection of length 10 trues and false.
So where could we stick this, this ds_heights>160? Where can we put this to selectively return only those elements which are greater than 160? So if we want to selectively return based on a Boolean, then we can put some brackets around this, and then I can essentially query the data series for those elements that are greater than my threshold, which satisfy a Boolean condition.
So I'm selectively returning based on that. So layer more than one query. So we can ask which are greater than 160, and which are less than 200, for example. So I need to surround my extra queries with brackets. I want those which are also less than 200. Multiple questions, I can ask my data series, so I only get out the data which evaluates is true. And note that we're preserving the index, the key value mapping is the same. So this is a very useful thing to be able to do. We call it masking because we're essentially creating a mask of trues and falses that we put over our collection. And only those things which are true come out of the mask, and things that are false, don't show through the mask. So it's a conceptual way of thinking about it.
Now we can work on our data series again, as it is an array. So I could say those which satisfy this condition, I want to multiply a weight of 0.967. And then I would perform this operation solely on these values here. And then I could throw that back into my data. So we can work on this as if it were a NumPy array.
So we can actually take this one step further, where those values which are evaluated as being true, we can perform an operation, or we can set to zero. We can do something to them, we can change the actual underlying structure by performing the assignment operation for these values, which evaluation is true. I could, as an example, nullify all of those which satisfy this condition.
So what I'm asking is everything which satisfies these conditions is set to zero. And we can see that I've actually updated the underlying structure here. So this is slightly unusual Python syntax. It goes a little bit against traditional Python syntax, but it works. And when you're working with something like Pandas, you'll find yourself doing this kind of thing quite a lot.
So I could ask you a question, which of my values are outliers? And I could set it equal to an expected value, the mean, the population mean or something like that. I could set it to whatever I want it to.
So we have methods that do similar things for us. So I have ds_heights.mask, where I can specify some Boolean condition. So I'm going to ask which of my elements are equal to zero, for example. So we've got not numbers in now. So Not a Number is NumPy's preferred way and Pandas' preferred way of dealing with missing information.
If you import a CSV file with empty cells, you'll get Not a Numbers. If you perform calculations that don't evaluate, they get too big or too small for example, Not a Numbers are going to pop up. If you want to define missing data, the best way to do it is using Not a Numbers. And Not a Numbers are considered by NumPy as a floating point value. Values that are equal to zero have been converted to Not a Number.
So the purpose of a mask is for those values which satisfy condition by default, set them to Not a Number. I can specify what I would actually want to change them to if they do satisfy my condition. So I could say, I want this to be given by ds_heights or np.mean if the heights of my data series, for example. And there we go. I throw the mean in for each of these guys here and it's given me the mean of my data series, including those values which were zero. So that's mask, that's how mask works.
Now, because I haven't reassigned my variable or anything, I've not actually updated the underlying structure, all these zeros are still there. So it's worth noting now that there are two main ways that I could decide to update the underlying structure. I can say ds_height is now equal to the result of this. That is option number one.
Option number two is to use an in-place keyword argument. What the in place keyword argument would do if I want to identify all my values, which are equal to 45.3, for example, and I want to set them back to being Not a Numbers, in place by default is set to false. I want to set in place to true. In place also means update the underlying structure. So I can either reassign the variable, or I can specify an argument in the method that I'm applying that says actually do this to the objects. And when we set in place equals to true, if I run this, I get an output. I've masked, I've done all this stuff, but nothing has changed, the underlying structure is exactly the same as it was before.
However, if I specify in place equals to true, I don't get output because it's essentially performing this operation. It's not returning anything. It's performing the operation on it. But when I have a look at printing this out, now I've changed it. So your two options for updating these structures are in place equals true, or variable reassignment, whichever you're most comfortable with.
So let's have a look at another method quickly called where. Where is very good at confusing people, where does the opposite of mask. So if something satisfies the condition, where we'll leave it alone, for those which don't satisfy the condition, it will do something to it. It is literally just the polar opposite of mask. So I've got ds_heights, if it's equal to np.NaN, then I want to leave it alone.
Well, if I do this, then everything is going to be turned to Not a Number. So what this has done, is it's found everything that was Not a Number and left it alone. And then everything that wasn't Not a Number, it's turned into Not a Number. This was just to show you that it does the opposite of mask.
Lectures
Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.