The course is part of this learning path
Pandas is Python’s ETL package for structured data and is built on top of NumPy, designed to mimic the functionality of R data frames. It provides a convenient way to handle tabular data and can perform all SQL functionalities, including group-by and join. Furthermore, it's compatible with many other Data Science packages, including visualization packages such as Matplotlib and Seaborn.
In this course, we are going to explore Pandas and show you how it can be used as a powerful data manipulation tool. We'll start off by looking at arrays, queries, and dataframes, and then we'll look specifically at the groupby function before rounding off the course by looking at how the merge and join methods can be used in Pandas.
If you have any feedback relating to this course, feel free to tell us about it at support@cloudacademy.com.
Learning Objectives
- Understand the fundamentals of the Pandas library in Python and how it is used to handle data
- Learn how to work with arrays, queries, and dataframes
- Learn how to use the groupby, merge, and join methods in Pandas
Intended Audience
This Course is intended for data engineers, data scientists, or anyone who wants to use the Pandas library in Python for handling data.
Pre-requisites
To get the most out of this course, you should already have a good working knowledge of Python and data visualization techniques.
Resources
The dataset(s) used in the course can be found in the following GitHub repo: https://github.com/cloudacademy/practical-data-science-python
Now we're going to move on to DataFrames, which are somewhat more complicated objects. We'll start by looking at these DataFrames, so I think that initially we will define a simple DataFrame. Okay, so I'm creating some students and some scores. And I'm just then going to take, I'm going to create, a variable called exam results, which is given by pd.DataFrame. So it's actually how I constructed a series, but I'm saying DataFrame instead.
So instead of pd.series, I have pd.DataFrame. Key value pairs, column header, data going into actual columns. So we have a look at what exam results look like. And to how it was before, that's better. That's how we would create it. What I actually want us to do is we're going to load in some data. We're going to load in the Titanic dataset. It's quite a commonly used dataset. It has information about people who were on the Titanic, whether they survived or did not survive, what class of cabin they were in, so on and so forth.
We're going to pull this in from seaborn. So seaborn, and we need to run import seaborn as sns. And then we're going to run titanic is equal to sns.load_dataset.
Now seaborn is traditionally a visualization library. We're going to use it as visualization library later on in this course, but it also comes bundled with a few datasets for creating example visualizations. So it's quite convenient for us to just pull this in from seaborn. Titanic dataset is really just a csv.
So once we have this if we look at the titanic.head we will hopefully have all our data, hopefully have a dataset, containing various columns. The head part is just the top five rows. I can have the first 10. I can have the first 100. And then to get the final few we need to use the opposite of head: tail. So tail, final five elements. So it defaults to five. But we can have as many as we want. We also have sample. This gives us a random sample of a single element. We could have a sample of 10 elements again randomly. Just have a look at what data actually looks like. These are ways of peaking at our data without printing out all 800 lines.
Okay, I'm just doing the head. So how would we look for missing elements? How could I find missing elements in the data series, and how could I count them? So what I would use is isna and then sum. So titanic.isna.sum; if I run this over a DataFrame what I'm getting are keys. Here are the keys representing the columns of my dataset, and then the values number of missing elements within each column. Two method file and I've got this information about this already.
I could use this to very quickly plot the number of values that I have missing in my dataset. I'm going to throw in a sort values here, where the ascending is equal to false. And then with a nice bar chart, giving us an idea of how much missing data we've got within each column. And we've done that using one, two, three, four method parts. Chain them together again. We know what data type we've got coming out each time.
So in Pandas, how can we get a statistical summary of the table? Well, we'd use describe. If I call titanic.describe, what I get is a summary of all the numeric fields.
Now I should mention we can actually describe our data series as well. So if I did describe on my height, what I'm going to get is a numeric description because this is numeric data. Counts, the number of elements, you get the mean, and standard deviation, the minimum, etc, etc.
So if you're a bit lazy and you couldn't be bothered running any of your NumPy functions to find out what the means and standard deviation are, you could just say, okay, give me the mean from the describe method. It's not the recommended way of doing things, but it's another way you can do it.
We have a description of all our numeric columns here. We could get a description: titanic.describe. I can use include, the include parameter equal to, well, we'll do all of it, include. I can include everything. This is my new numeric, my categorical, my this, my that, and the other. It's not real easy to see what's going on here, but this shows you that you can describe everything in the same table. But a lot of it's going to be missing, because some things make sense for some categories and datatypes, and they don't make sense for others.
So I can simply pass in, for example, object datatype. Generally this means your strings and things like that. These are all of our object datatypes, and what we get is a count of the number of elements in each column and the number of unique different categories that we have in each of those, the most frequent and the frequency with which they occur. So again a very useful sort of overview of what's going on in our dataset.
So we do in fact have titanic.subscribe, and let's see what else we can include. I think there's categorical data that we might also want to have look at, NumPy dot, select categorical columns and use category. So we also have category data, but we get a similar description for these guys here, so I'm getting class and deck with this one.
Again, same over here that we have done for this. So this is string data. Just object data, nominal data. Whereas we have categorical data here. There's a small distinction with Pandas. We have another description method. We have titanic.info for example. This is another description method. What titanic.info does is it tells us the name of our columns, the number of columns we have, the name of each of them, how many non null elements we have, and then the datatype in each of our columns. We have a mix of integers, objects, floats and categories.
Now there are a vast number of ways that we can read data into Pandas. I've just ran a query here that obtains every method of Pandas that involves read. So these are all of the various datatypes that we can read. And we can extend some of these to go a little bit further. As long as you have a familiar type, then you'll be able to read it.
So we can read things like JSON. We can indirectly import JSON into tables. We can read an HTML. So you can actually get a webpage and try and put in HTML structured data. CSV is the most common one that you're gonna be able to add. CSV read is quite dynamic in that though it is CSV, so Comma Separated Values, it can in fact read any delimited file. So if some files are tab delimited, some of them are delimited by new lines and things like that, it will read that structure in for you to a table. You can also specify URLs which you would like to read from. You can put in information from online data sources as well. You give a web address, you'll put in data from there or Pandas will try its very best to read in as the format that you've specified. And we have some specific Excel reading commands as well. So you could use read_csv for that kind of file, but generally with Excel, you might want to specify a second worksheet to pull data in from. And Pandas will integrate with Excel and things like that. So it's pretty versatile, and that's not even really thinking about connecting to SQL databases and things like that. You can pass queries directly into SQL databases, set up connections, and pull the data in from that way.
So we'll have a look at a loc now and various methods of pulling data out of our DataFrame. So what do we have? We've got loc and we've got iloc. These are the preferred ways of working with data series. It's just the quickest way of accessing values within your data series. They're slightly weird methods. I want to call them methods, but they are methods which use square brackets to pull the data out. Kind of forget what you've learnt before about DataFrames.
If I wanted to pull out, for example, the first element of the age column, then I can do titanic.loc[0, 'age']. The zero row of the age column. Now Pandas data series will work with standard access notation, but I'm going to teach you the way that you should do everything. If you choose to get lazy after this, then that's fine, but this is the correct way of accessing this data, so that's what we're going to look at.
We have titanic. If I wanted pull out everything from the age column, then I would pass in a colon, and then also age here. The proper way to do it is this. It's the quickest and the best way to do it. You can be much more specific when you're doing it like this. It gets more complicated when you want to perform a more complex sub query when you don't do it in a proper way.
So this way here that you see on the screen now is how you should be doing it. You specify rows and columns. So now what types of data have I got coming out here? I have a series. So we've got a data series here. An index has been generated. We've got all the values associated with our index. We get some extra little bits of information. We say we know the name of the series is age. We know its length. And we know that the data type is float 64. So we've got this information here.
What else might we want to know? How do we think we can get all of the information about the 10th passenger? So what we'll do is we'll go with nine: [9, :]. And then what we get here is information about that person. So we can see they survived. The were in second class. So they were a female, second class, age 14, etc, etc. We've got all the information about that person. We get it as a dataset. The index being column, the values being the value for that person in each of these columns.
If I wanted to specify in pulling out more data than this, say, if I wanted from the fourth to the ninth passengers, then I would obviously get this out as another DataFrame. This is a sub DataFrame, and similarly with accessing values of a series, the index is preserved. So an index is generated and it will be preserved for this.
Loc, what we get is all the keys that were specified. But change this to iloc, and it drops one of the values at the end because iloc means index location. We aren't talking necessarily about the key anymore. We're specifically talking about just which number it is. So in this one, we're getting four, which is something worth noting.
So I could then ask for, I want the second to the seventh columns of the fourth to the ninth row of the second to the seventh columns. I don't care about the names anymore. The names mean nothing now. I'm just going for index location. If I tried to do this with loc, it wouldn't work. I'd have to do something between age and another one, sibsp, is one. And then I'm pulling out from eight up to sibsp, and it just so happens that they were next to each other. I didn't know it, but I'm getting these values for it.
Out of interest, sibsp, in this example stands for siblings present. So we can think of this as SQL select to some extent. We're also able to sort values. And when you want to sort values, you should do it by, you can specify the column by which you want to sort. You can also specify lists of columns you want to sort by.
This exercise, what I want to find out is the least expensive first class ticket. So we're going to pull it out of the data series. So titanic.loc, it gives me everything from titanic where titanic.loc of, oh let's put everything from the pclass, pclass column equals three. So this is giving me all the third class passengers.
Now we'll have a go at doing this for the least expensive first class tickets. We'll need to sort and you'll need to query to only get the first class and find out where they embarked, and whether they survived or not. For the least expensive we can do sort values.
Now we need to specify an extra something with our sort values, so we can sort values by fare. It's going to be descending by default, so you've got some people with zero here. And then we should be able to just scroll over to where they embarked.
So, we've got the least expensive first class ticket. We've got zero, and where did they embark? S, which stands for South Hampton. And we can sort by multiple columns and things like that. We want fare. And then let's say age. So what this means, we're initially going to start by fare and then after that we're going to start by age.
So you see we've got it broken down into a group sorting. So now let's have a look at adding columns. We can add a column that's the result of a computation between existing columns.
So what could we look for here? We could create a column called cpya, which is going to be year alive. So it's sort of useless calculation that we're doing, but it's just demonstrating what we can do. So I want a cost per year alive column.
Now how do I create this? Well, what I want to do is take the value of the fare column, so give me the column. All the rows from the fare column, and what I would like to do is divide them by age.
So I want to get titanic.loc. I want all the rows from the age column. And this is a rather simple operation. So take all the values from the fare column and divide them by the corresponding ages, row by row, and it will store the results into a column called cpya. I could call it cost per year of life, but cpya is a bit easier.
I'll have a look at the titanic.describe after this. What I get is information about the cost per year of life column added in here. So I know have 714 elements. The reason I have 714 is because I have some missing ages. I know that the mean cost per year alive is two pounds or two whatever this currency is, for every year that someone has existed. I have standard deviation.
I have percentiles. I have the maximum value. All these things, like all I had to do was name two columns and perform a calculation between the two. This is based on the assumption that they are going to be the same length. They have to be the same length, so that's how we can do things like this.
Okay, now I can also have a look at the most expensive ticket per year alive. If I want to do that, I could do titanic.loc, working with my column as it now exists. And what do I want here? Every row from the cpya. I want to sort my values. Values by ascending is equal to false. Ascending is equal to false. Then I want the head of that.
So I can see the most expensive ticket per year alive was 164 whatever currency units there are. I think it's pounds here. And it shows you that for some person. So this means that it was either a really expensive ticket or somebody who was really young. And we can actually see that that was a very young person.
So I want to sort value by this column instead. And we can see they were 0.92 years old. And so they had to pay a lot of money for that ticket. 151 pounds for their ticket. Therefore our cpya is pushed up when we divide by that. A child in first class, but they survived.
So the last thing we're going to look at now is rename. If I want to take titanic and I wanted to rename my alive column so it was instead given by not dead, as an example, then I could simply specify a dictionary where I have the old name and the new name. The only other thing I need to specify is axis equals one. And then what axis equals one means is columns. That says I'm looking for a column.
If I had axis is equal to zero, I would say I'm looking for a row called alive that I want to transform. So it's now called not dead and I would run it, and it would be fine. It's just that nothing would happen to my DataFrame. My column wouldn't update. Whereas if I specify axis equals one, then we should see that my alive column has now been called not dead.
Lectures
Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.