Using GROUP BY

Developed with
QA

Contents

keyboard_tab
Working With PANDAS
1
Introduction
PREVIEW12m 29s
2
Arrays
PREVIEW5m 37s

The course is part of this learning path

Practical Data Science with Python
course-steps
5
certification
4
lab-steps
3
description
4
play-arrow
Start course
Overview
DifficultyIntermediate
Duration57m
Students124
Ratings
5/5
starstarstarstarstar

Description

What is Pandas
Pandas is Python’s ETL package for structured data
Built on top of numpy, designed to mimic the functionality of R dataframes
Provides a convenient way to handle tabular data
Can perform all SQL functionalities, including group-by and join.
Compatible with many other Data Science packages, including visualisation packages such as Matplotlib and Seaborn
Defines two main data types:
pandas.Series
pandas.DataFrame

Series
Generalised array --- can be viewed as a table with a single column
It consists of two numpy arrays:
Index array: stores the index of the elements
values array: stores the values of the elements
Each array element has an unique index (ID), contained in a separate index array
If we reorder the series, the index moves with element. So an index will always identify with the same element in the series
Indices do not have to be sequential, they do not even have to be numbers.
Think indices as the primary keys for each row in a single column table

DataFrames
A pandas DataFrame represents a table, it contains
Data in form of rows and columns
Row IDs (the index array, i.e. primary key)
Column names (ID of the columns)
A DataFrame is equivalent to collection of Series with each Series representing a column
The row indices by default start from 0 and increase by one for each subsequent row, but just like Series they can be changed to any collection of objects
Each row index uniquely identifies a particular row. If we reorder the rows, their indices go with them

Group By
Groups are usually used together with reductions
Counting number of rows in each group
my_dataframe.groupby(criteria).size()
Sum of every numerical column in each group
my_dataframe.groupby(criteria).sum()
Mean of every numerical column in each group
my_dataframe.groupby(criteria).mean()

Join
Use DataFrame.merge() as a general method of joining two dataframes:
Works also with series
Joins on the primary keys of the two dataframes (series)

Missing Values
Finding out number of missing values in each column
my_dataframe.isna().sum()
Removing rows
my_dataframe.dropna(axis = 0)
Removing columns
my_dataframe.dropna(axis = 1)
Filling with a value
For all missing values: my_dataframe.fillna(replacement_value)
Different value for each column: my_dataframe.fillna({‘NAME’: ‘UNKNOWN’, ‘AGE’: 0})

Map, Replace, Apply
Map applies a mapping to every element of the dataframe
my_dataframe.map({old1: new1, old2: new2, …})
my_dataframe.map(function)
If we provide map using a dictionary, then any elements not in the keys will be mapped to numpy.nan
Replace applies a mapping to only elements of the dataframe that have been mentioned in the mapping
my_dataframe.replace ({old1: new1, old2: new2, …})
Any elements not in the dictionary keys will not be changed

Transcript

In this section, we're going to have a look at GROUP BY. So within the Titanic dataset, we can group by using a method called, family enough GROUP BY. And this is group by without a hyphen in between it. When we want to group by something, we choose a category of a column. If I group by class, then what you get out is this thing called a generic data frame group by object.

Now, the reason we get this object is because Python and Pandas, don't really know what to do, with a collection of data frames. They don't really know which one to put out. So you've got to perform a summary operation to be able to get a summary of your groups, for example. So I'm just going to run the count operation. It's one of the more simple ones that we can have.

So what we've done is we've grouped by class, class has values first, second, and third. And then what we're doing is we're counting every one within the number of entries we have for those values in each column. Can perform any summary operation, anything that's going to result in a numeric attribute so we can perform something like the mean, now it will only compute the mean for numeric columns. And some of our columns are numeric.

So, we're getting the mean of the passenger class, because that's a numeric column. So technically the mean of passenger class is two, and, we're getting the mean of survive because that is binary one or zero. But we'd get an idea when we take the mean or something binary, of the rough proportion of the people that did or did not survive. A lot of people in first class tended to survive in comparison with the other classes here.

So, any operation would work here, so we can have, for example, standard deviation across all our classes as well. All of the numeric columns. So it's a good thing to be able to do. And what else might we want to do? Well, we might want to group by more than one column, grouping by multiple axis. So I'm grouping by class first, and then whether or not people survived. So I could do a count of these values as well.

So what we can see here, is that we got a lot of first class people who survived and then, a lot of third class people who did not survive. And there's a smaller number of third class people who did survive in comparison with those in first class. Can realistically group by many as these as we want.

So what we should notice is that, when we have a look at this table that highlights appropriately, as to which grouping we're part of. So we can add more and more of these as we want to. We've got individual segments for everything. And we can perform calculations over the various axis in this data frame.

So I'm going to call this grouped_df. If I want to access elements of it, is generally going to be best to use dot loc. I want every column for the first class people who came, who didn't survive and who embarked in sea, which I believe is Cherbourg in France. And then I want to have a column for that. Then I want to work with that as something called a multi-index object. All that means is I've got to specify each individual path, that I want to go down. And if I leave one of them off like this, then I will get for example, I've pulled out all the first class people who died, I have now obtained all the various ports, which they embarked, et cetera, et cetera, Is the index where I've grouped, by where the people embarked.

I've taken a sort of subsection of the data. I have jumped the gun a little bit, and, I've sort of gone more complicated than I normally would. So now I just want to show you the simple way of obtaining your groups data. So you don't have to perform a summary operation. I could just group by something like class and then parse in, once I've got this guy here, I could just use the get group. Get group. Get group method. And then, the class as well.

So these were the first class people. It gives me all the second class people and so on and so forth. So, I can pull out individual groups. I could run a Lambda for example, a simple Lambda would be, X percentage two double equals zero. This is checking if the count of people in each of the boxes is even. But I could do something a bit more complicated, such as min max scaling. That's an example of how I could do this using a one line Lambda function, where I take every value away from the minimum value in the column. And then divide by the max minus the min for each column as well.

So, it's a form of scaling our values here. So I think I would actually want to do this without counting. Now we can group by our own boolean conditions as well, Titanic. I could group by titanic dot group by, only those values for which titanic dot loc fare is greater than 70. And then I can get the count of these, for example. And this gives me the total count of people. The total count of entries in each column, out of those whose fare was greater than 70 in those whose fair was below 70.

So this is me specifying my own condition, which I want to group by. So I'm not just restricted to say, categorical columns. So, anything that evaluates as a boolean condition, I can do this with. I can also map a function over a data frame, and have 10 groups based upon which bucket I want to put people in.

Lectures

About the Author

Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.