Working With Queries

Developed with


Working With PANDAS
PREVIEW12m 29s

The course is part of this learning path

Practical Data Science with Python
Start course


What is Pandas
Pandas is Python’s ETL package for structured data
Built on top of numpy, designed to mimic the functionality of R dataframes
Provides a convenient way to handle tabular data
Can perform all SQL functionalities, including group-by and join.
Compatible with many other Data Science packages, including visualisation packages such as Matplotlib and Seaborn
Defines two main data types:

Generalised array --- can be viewed as a table with a single column
It consists of two numpy arrays:
Index array: stores the index of the elements
values array: stores the values of the elements
Each array element has an unique index (ID), contained in a separate index array
If we reorder the series, the index moves with element. So an index will always identify with the same element in the series
Indices do not have to be sequential, they do not even have to be numbers.
Think indices as the primary keys for each row in a single column table

A pandas DataFrame represents a table, it contains
Data in form of rows and columns
Row IDs (the index array, i.e. primary key)
Column names (ID of the columns)
A DataFrame is equivalent to collection of Series with each Series representing a column
The row indices by default start from 0 and increase by one for each subsequent row, but just like Series they can be changed to any collection of objects
Each row index uniquely identifies a particular row. If we reorder the rows, their indices go with them

Group By
Groups are usually used together with reductions
Counting number of rows in each group
Sum of every numerical column in each group
Mean of every numerical column in each group

Use DataFrame.merge() as a general method of joining two dataframes:
Works also with series
Joins on the primary keys of the two dataframes (series)

Missing Values
Finding out number of missing values in each column
Removing rows
my_dataframe.dropna(axis = 0)
Removing columns
my_dataframe.dropna(axis = 1)
Filling with a value
For all missing values: my_dataframe.fillna(replacement_value)
Different value for each column: my_dataframe.fillna({‘NAME’: ‘UNKNOWN’, ‘AGE’: 0})

Map, Replace, Apply
Map applies a mapping to every element of the dataframe{old1: new1, old2: new2, …})
If we provide map using a dictionary, then any elements not in the keys will be mapped to numpy.nan
Replace applies a mapping to only elements of the dataframe that have been mentioned in the mapping
my_dataframe.replace ({old1: new1, old2: new2, …})
Any elements not in the dictionary keys will not be changed


So there are other ways I can specify transformations I may want to perform on a series. We can have a look at a few of those now. The simplest mapping is called replace and I am going to show you how replace works.

So, we run replace and I wanted to replace all of the np.NaNs with a zero and I wanted to replace all of the 123s, so one, two, three point zero with let's change it to being 200, for example.

So I'll run this and we should see that all of my missing data, including the num has been turned to zero and then my value of 123 has been turned into the number 200, so what have I passed in as the object here? I've passed in a dictionary, so the key is what replace is looking for and the value is what it's going to replace it with, so you could build up an enormous dictionary of transformations you want to perform on your data or map things between one value and another and just run replace on it and it will map all of the values. It's not necessarily dynamic, but this is just looking for specific values and changing them to specific other ones.

We also have something called map. I can use map to perform tasks using functions that I have defined. A simple one would be round. Let's multiple all of my heights by point nine five, so I'm going to get a few floating points. Well, one, one, one, get some floating point numbers in there and I'm going to reassign this here, so I've got some floating point numbers.

If I wanted to round these according to some criteria, I can define a function which will do that for me. If I want to define my round, there is round function under these, so I'll pass in a value. With that value I run round. I want to round X to two decimal places, for example. Then I return this. It rounds to two decimal places, so the function I've got, the way the function works is if I run my round, I pass in one point one, one, one, one, one, nine, nine, nine and what I'm going to get is this function rounded to two decimal places and that's simple enough. 

If I wanted to pass this function over my data series, that's when I would use map. I would say I want to take my heights data series. I would like to map over it the function round, my round, sorry. I don't need to have brackets here because I'm mapping the function over it. I just pass a reference to this function and then my data series will have its values rounded appropriately. This is passing a function, element-wise, over a data series.

I could map something like abs which would be the absolute value of all these things. I can do something silly here like run print over it. What this will do is it will print out each element and then every element will become a reference to none because print returns nothing. Any function which takes a single argument we can map over this here, np dot square root. I can map that over it, so taking the square root of all my data.

Now, this is a good moment to have a look at lambdas. If you want to have a unique transformation, a simple one line bit of code that you want to run, you could use something like a lambda for that. What I mean by lambda is simply a function that I don't define to have a name. Lambdas are the nameless functions.

If I wanted to .map this using a lambda, I'm having one value go in and what I want to do with that one value is I want to round it to two decimal places. This line of code is equivalent to mapping the my round function over it. This is the exact same thing. So we have lambda. We have what we're going to call a variable going in and what we want to do to that variable, what we want to return, generally small one line functions that we can just create, carry out and then let vanish. We could make this complicated if we like, but it's not really in the vein of lambda, so it's good to know that we have lambdas available to us.


About the Author

Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.