Working with PANDAS

Developed with
QA

Contents

keyboard_tab
Working With PANDAS
1
Introduction
PREVIEW12m 29s
2
Arrays
PREVIEW5m 37s

The course is part of this learning path

Practical Data Science with Python
course-steps
5
certification
4
lab-steps
3
description
4
play-arrow
Introduction
Overview
DifficultyIntermediate
Duration57m
Students119
Ratings
5/5
starstarstarstarstar

Description

What is Pandas
Pandas is Python’s ETL package for structured data
Built on top of numpy, designed to mimic the functionality of R dataframes
Provides a convenient way to handle tabular data
Can perform all SQL functionalities, including group-by and join.
Compatible with many other Data Science packages, including visualisation packages such as Matplotlib and Seaborn
Defines two main data types:
pandas.Series
pandas.DataFrame

Series
Generalised array --- can be viewed as a table with a single column
It consists of two numpy arrays:
Index array: stores the index of the elements
values array: stores the values of the elements
Each array element has an unique index (ID), contained in a separate index array
If we reorder the series, the index moves with element. So an index will always identify with the same element in the series
Indices do not have to be sequential, they do not even have to be numbers.
Think indices as the primary keys for each row in a single column table

DataFrames
A pandas DataFrame represents a table, it contains
Data in form of rows and columns
Row IDs (the index array, i.e. primary key)
Column names (ID of the columns)
A DataFrame is equivalent to collection of Series with each Series representing a column
The row indices by default start from 0 and increase by one for each subsequent row, but just like Series they can be changed to any collection of objects
Each row index uniquely identifies a particular row. If we reorder the rows, their indices go with them

Group By
Groups are usually used together with reductions
Counting number of rows in each group
my_dataframe.groupby(criteria).size()
Sum of every numerical column in each group
my_dataframe.groupby(criteria).sum()
Mean of every numerical column in each group
my_dataframe.groupby(criteria).mean()

Join
Use DataFrame.merge() as a general method of joining two dataframes:
Works also with series
Joins on the primary keys of the two dataframes (series)

Missing Values
Finding out number of missing values in each column
my_dataframe.isna().sum()
Removing rows
my_dataframe.dropna(axis = 0)
Removing columns
my_dataframe.dropna(axis = 1)
Filling with a value
For all missing values: my_dataframe.fillna(replacement_value)
Different value for each column: my_dataframe.fillna({‘NAME’: ‘UNKNOWN’, ‘AGE’: 0})

Map, Replace, Apply
Map applies a mapping to every element of the dataframe
my_dataframe.map({old1: new1, old2: new2, …})
my_dataframe.map(function)
If we provide map using a dictionary, then any elements not in the keys will be mapped to numpy.nan
Replace applies a mapping to only elements of the dataframe that have been mentioned in the mapping
my_dataframe.replace ({old1: new1, old2: new2, …})
Any elements not in the dictionary keys will not be changed

Transcript

So we're going to have a look at Pandas. Pandas is the go-to data manipulation library of Python. It integrates very heavily with NumPy. So the operations that we learn how to do in the previous course with NumPy, we can carry out on Pandas objects, because Pandas will treat them as NumPy. But as a sort of slightly nicer, easier to use interface within the context. And you have methods that allow you to access shall we say, more traditional data manipulation tasks. All these things are built into data frames, whereas NumPy is very much about computation.

So to demonstrate this, I'm going to pull in some data. We need to import Pandas. The standard alias for Pandas is PD. So we import NumPy as NP and import Pandas as PD, and these are standards, and you'll see this in every sort of program. We're going to create something called a data series. So we're going to start off with looking at data series, which are one-dimensional data manipulation objects. They're essentially lists with advanced indexing objects and a vast array of functionalities they're associated with.

Okay, so how are we going to make this data series? Well, what I'm going to do is I'm going to create a series called a ds underscore, and I'll call it heights. I'm going to create a data series by calling it pd.series, which is the series constructor. And then I'm just going to pass in my heights NumPy array.

I'm going to call flatten on them, just because series will get annoyed if I don't flatten, I need a one-dimensional array of data. My heights is by default going to be two dimensional. So if I just squash it down, then we're all good. If this was a data frame, this would be fine. We wouldn't call flatten. But because it's a series, we're going to call flatten on it. And I want to, I think only take the first 10 elements of this array. Otherwise, I'll be printing out lots and lots of things onto the screen every single time and that's not ideal.

So we should end up with something that looks like this. So here we can see we have a data type associated with it, which is int32. What else do we get when we print it out? This will look exactly the same when we print it as when we simply output it like this, nothing changes.

So we've got our data type, we have our values. And then we have by default, an index associated with these values. So it's more of a directly indexed list. In fact, the values within it are a collection of key value pairs. So it's not simply a list, there are keys and values.

We can have a look at what our index looks like by calling ds_heights.index. It just tells us that our index starts at zero, stops at 10, and goes in steps of one. That's our default range index objects that we've been given when we've created this. And we can pull out the values by calling ds_heights.values. And what type of object do I get when I call .value? Well, I get an array. It's a NumPy array, which we use when we want to pull out this sort of hard data from a data series.

Now to pull the fourth element of my data series, I would use brackets and four. And if we want the final element of my data series, then I would need to call the actual element, which is the number nine here. Now I can actually make it as I can get out the final elements by calling negative one and a colon. That will give me the final elements of the data series as a new data series.

So based on this, we can slice between keys. So if I have my height here, and I indexed by true and false times five, we see what happens when we multiply a list by a number. So we end up with a list, this list five times. So if you do true false times five, you get true false, true false, true false, true false, true false. So I've got true and false; 10 trues, and falses here. 

I've stopped that into the brackets for this data series. So you'll notice a pattern in the index from before. So the index follows the two times table, we have getting out every second element. When a data series sees a value of true for an index, it's going to give us that value out. And if it's false, it's not going to give that value. So it's selectively returning based upon the Boolean that I'm passing into it.

Now, this is the raw version of doing this. If you wanted alternating elements out, you could always do that as well. Similarly, I can ask a question to my data series. I could ask which of the elements in my data series are greater than 160. It hasn't actually done anything to the underlying structure.

If I take ds_heights, it's the same as it always was. So it's still just a collection of these. If I ask the question, which of the elements are greater than 160? It decides if it's true or false. 

So we've got a Boolean data type and we now have associated with each of the keys, if a value is true or false, whether it is or is not greater than 160. So it's actually a query, of here is a collection of length 10 trues and false.

So where could we stick this, this ds_heights>160? Where can we put this to selectively return only those elements which are greater than 160? So if we want to selectively return based on a Boolean, then we can put some brackets around this, and then I can essentially query the data series for those elements that are greater than my threshold, which satisfy a Boolean condition.

So I'm selectively returning based on that. So layer more than one query. So we can ask which are greater than 160, and which are less than 200, for example. So I need to surround my extra queries with brackets. I want those which are also less than 200. Multiple questions, I can ask my data series, so I only get out the data which evaluates is true. And note that we're preserving the index, the key value mapping is the same. So this is a very useful thing to be able to do. We call it masking because we're essentially creating a mask of trues and falses that we put over our collection. And only those things which are true come out of the mask, and things that are false, don't show through the mask. So it's a conceptual way of thinking about it.

Now we can work on our data series again, as it is an array. So I could say those which satisfy this condition, I want to multiply a weight of 0.967. And then I would perform this operation solely on these values here. And then I could throw that back into my data. So we can work on this as if it were a NumPy array.

So we can actually take this one step further, where those values which are evaluated as being true, we can perform an operation, or we can set to zero. We can do something to them, we can change the actual underlying structure by performing the assignment operation for these values, which evaluation is true. I could, as an example, nullify all of those which satisfy this condition.

So what I'm asking is everything which satisfies these conditions is set to zero. And we can see that I've actually updated the underlying structure here. So this is slightly unusual Python syntax. It goes a little bit against traditional Python syntax, but it works. And when you're working with something like Pandas, you'll find yourself doing this kind of thing quite a lot.

So I could ask you a question, which of my values are outliers? And I could set it equal to an expected value, the mean, the population mean or something like that. I could set it to whatever I want it to.

So we have methods that do similar things for us. So I have ds_heights.mask, where I can specify some Boolean condition. So I'm going to ask which of my elements are equal to zero, for example. So we've got not numbers in now. So Not a Number is NumPy's preferred way and Pandas' preferred way of dealing with missing information.

If you import a CSV file with empty cells, you'll get Not a Numbers. If you perform calculations that don't evaluate, they get too big or too small for example, Not a Numbers are going to pop up. If you want to define missing data, the best way to do it is using Not a Numbers. And Not a Numbers are considered by NumPy as a floating point value. Values that are equal to zero have been converted to Not a Number.

So the purpose of a mask is for those values which satisfy condition by default, set them to Not a Number. I can specify what I would actually want to change them to if they do satisfy my condition. So I could say, I want this to be given by ds_heights or np.mean if the heights of my data series, for example. And there we go. I throw the mean in for each of these guys here and it's given me the mean of my data series, including those values which were zero. So that's mask, that's how mask works.

Now, because I haven't reassigned my variable or anything, I've not actually updated the underlying structure, all these zeros are still there. So it's worth noting now that there are two main ways that I could decide to update the underlying structure. I can say ds_height is now equal to the result of this. That is option number one.

Option number two is to use an in-place keyword argument. What the in place keyword argument would do if I want to identify all my values, which are equal to 45.3, for example, and I want to set them back to being Not a Numbers, in place by default is set to false. I want to set in place to true. In place also means update the underlying structure. So I can either reassign the variable, or I can specify an argument in the method that I'm applying that says actually do this to the objects. And when we set in place equals to true, if I run this, I get an output. I've masked, I've done all this stuff, but nothing has changed, the underlying structure is exactly the same as it was before.

However, if I specify in place equals to true, I don't get output because it's essentially performing this operation. It's not returning anything. It's performing the operation on it. But when I have a look at printing this out, now I've changed it. So your two options for updating these structures are in place equals true, or variable reassignment, whichever you're most comfortable with.

So let's have a look at another method quickly called where. Where is very good at confusing people, where does the opposite of mask. So if something satisfies the condition, where we'll leave it alone, for those which don't satisfy the condition, it will do something to it. It is literally just the polar opposite of mask. So I've got ds_heights, if it's equal to np.NaN, then I want to leave it alone.

Well, if I do this, then everything is going to be turned to Not a Number. So what this has done, is it's found everything that was Not a Number and left it alone. And then everything that wasn't Not a Number, it's turned into Not a Number. This was just to show you that it does the opposite of mask.

Lectures

About the Author

Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.