Arrays

Developed with
QA

Contents

keyboard_tab
Working With PANDAS
1
Introduction
PREVIEW12m 29s
2
Arrays
PREVIEW5m 37s

The course is part of this learning path

Practical Data Science with Python
course-steps
5
certification
4
lab-steps
3
description
4
play-arrow
Arrays
Overview
DifficultyIntermediate
Duration57m
Students126
Ratings
5/5
starstarstarstarstar

Description

What is Pandas
Pandas is Python’s ETL package for structured data
Built on top of numpy, designed to mimic the functionality of R dataframes
Provides a convenient way to handle tabular data
Can perform all SQL functionalities, including group-by and join.
Compatible with many other Data Science packages, including visualisation packages such as Matplotlib and Seaborn
Defines two main data types:
pandas.Series
pandas.DataFrame

Series
Generalised array --- can be viewed as a table with a single column
It consists of two numpy arrays:
Index array: stores the index of the elements
values array: stores the values of the elements
Each array element has an unique index (ID), contained in a separate index array
If we reorder the series, the index moves with element. So an index will always identify with the same element in the series
Indices do not have to be sequential, they do not even have to be numbers.
Think indices as the primary keys for each row in a single column table

DataFrames
A pandas DataFrame represents a table, it contains
Data in form of rows and columns
Row IDs (the index array, i.e. primary key)
Column names (ID of the columns)
A DataFrame is equivalent to collection of Series with each Series representing a column
The row indices by default start from 0 and increase by one for each subsequent row, but just like Series they can be changed to any collection of objects
Each row index uniquely identifies a particular row. If we reorder the rows, their indices go with them

Group By
Groups are usually used together with reductions
Counting number of rows in each group
my_dataframe.groupby(criteria).size()
Sum of every numerical column in each group
my_dataframe.groupby(criteria).sum()
Mean of every numerical column in each group
my_dataframe.groupby(criteria).mean()

Join
Use DataFrame.merge() as a general method of joining two dataframes:
Works also with series
Joins on the primary keys of the two dataframes (series)

Missing Values
Finding out number of missing values in each column
my_dataframe.isna().sum()
Removing rows
my_dataframe.dropna(axis = 0)
Removing columns
my_dataframe.dropna(axis = 1)
Filling with a value
For all missing values: my_dataframe.fillna(replacement_value)
Different value for each column: my_dataframe.fillna({‘NAME’: ‘UNKNOWN’, ‘AGE’: 0})

Map, Replace, Apply
Map applies a mapping to every element of the dataframe
my_dataframe.map({old1: new1, old2: new2, …})
my_dataframe.map(function)
If we provide map using a dictionary, then any elements not in the keys will be mapped to numpy.nan
Replace applies a mapping to only elements of the dataframe that have been mentioned in the mapping
my_dataframe.replace ({old1: new1, old2: new2, …})
Any elements not in the dictionary keys will not be changed

Transcript

So the index in a data series is, in fact, something that we can dictate much like a dictionary. We can decide to index by anything we want really. We can index by date, time, objects, whatever we feel like. So looking at the last index we used, what I want to do is I want to call these people patients.

So what I really want is a list of strings that say patient one, patient two, patient three, patient four, and so on. To do this, I can use comprehensions. If I wanted to have patient plus string for I in range one to 10, what this is going to give me hopefully is the list of strings that say patient. So patient one, patient two, patient three.

Now I can re-index my heights by calling DS heights. I can set a new index if I want to. So index equals patient. Now, if I run this and have a look at it, and they have an index for patient one, patient two, patient three, et cetera, et cetera, et cetera, associated with the values within the collection.

If I just want to specify the index for one, two, three, I put string with patient age. I could ask for all the values between patient four and patient date, slicing between string indices. We're just creating these as keys, but there is an inherent ordering within my key.

Let's have a look at the few of the methods that we have available to us. A simple one is sorting values. It comes in useful when we want to do various tasks. If I have SOR values, it requires that I pass in a parameter dictating which way I want to sort. And by default it will sort in ascending order.

If I said ascending is equal to false, then that will switch the order. Not a Numbers are not ordered. So this will switch the order in which we are sorting. So we're just leaving the Not numbers by themselves, and we're sorting 204, 126, 123.

By default, ascending is equal to true, and this is just a quirk that you need to be aware of. Then we have value counts, and value counts are very useful. They tell us how many of each value that we have. And by default, it's not going to consider a Not a Number as a value.

So I need to drop, set dropna as false. So then we also get an idea of how many Not a Numbers we have. By default it doesn't consider them to be values, true values. You can use this to do something quickly, like plot kind equals bar, and get a quick and easy bar plot, the count of each value that I have within my data series.

So Pandas integrates with Matplotlib, which we haven't had look at yet. We're going to look at that later. It integrates very well with Matplotlib, and you can quickly plot quite easily. You can make it a lot prettier than this, but this is just an example plot here on the screen. Let's have a quick look at value counts.

So value counts are very good because they show the count of a value in an index. So here the index is now the values from the index before. And the value is now the count of that value in the index. You have seven Not a Numbers. You have one value of 123 and so on. If we want to add a value, a single value to our data series, I can add in a value which is going to be adding in this case, for example, patient zero. And this is going to be equal to 100. For example, this will add a new entry into my new data series.

Good, so I've got patient zero. He gets stuck on the end by default, and I can change the ordering if I want. I can ask how much missing data, or which of my values are now missing. Isna. Isna gives me a data series of Boolean saying whether you are or are not missing a value. A point of interest is if I put in none in the data series, if I give it a new entry, make it 100, it has defaulted to give it a Not a Number. So I have Not a Number and I have none. None is Python's missing data value. Whereas Numpy.NaN is Numpy's missing data value, but is seen as equivalent within Pandas.

So I ask whether a value is NA I run the isna query .isna, then both Not a Numbers and nones are considered missing. So it doesn't matter if you have a Numpy missing value, or if you've got a Python missing value, Pandas will pick up on both.

I could do value counts as well. So I could do this true false. So this is giving me the count of those which are missing and those that aren't missing. So what is true plus true? It's two. So what we could do to our data series to get an idea of the counter missing values is sum up all the trues. Because trues are missing values and falses are not missing values.

So we can add them up. We have nine different values when we add the trues together. True is one, so every time we encounter a true, we're adding one to our count, and this is a count of missing values. So height.isna.sum, you'll see this very often. This is just the total number of missing values.

Lectures

About the Author

Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.