image
Working With Queries

Contents

Working With PANDAS
1
Introduction
PREVIEW12m 29s
2
Arrays
PREVIEW5m 37s

The course is part of this learning path

Start course
Difficulty
Intermediate
Duration
57m
Students
1043
Ratings
4.2/5
starstarstarstarstar-border
Description

Pandas is Python’s ETL package for structured data and is built on top of NumPy, designed to mimic the functionality of R data frames. It provides a convenient way to handle tabular data and can perform all SQL functionalities, including group-by and join. Furthermore, it's compatible with many other Data Science packages, including visualization packages such as Matplotlib and Seaborn.

In this course, we are going to explore Pandas and show you how it can be used as a powerful data manipulation tool. We'll start off by looking at arrays, queries, and dataframes, and then we'll look specifically at the groupby function before rounding off the course by looking at how the merge and join methods can be used in Pandas.

If you have any feedback relating to this course, feel free to tell us about it at support@cloudacademy.com.

Learning Objectives

  • Understand the fundamentals of the Pandas library in Python and how it is used to handle data
  • Learn how to work with arrays, queries, and dataframes
  • Learn how to use the groupby, merge, and join methods in Pandas

Intended Audience

This Course is intended for data engineers, data scientists, or anyone who wants to use the Pandas library in Python for handling data.

Pre-requisites

To get the most out of this course, you should already have a good working knowledge of Python and data visualization techniques.

Resources

The dataset(s) used in the course can be found in the following GitHub repo: https://github.com/cloudacademy/practical-data-science-python 

Transcript

So there are other ways I can specify transformations I may want to perform on a series. We can have a look at a few of those now. The simplest mapping is called replace and I am going to show you how replace works.

So, we run replace and I wanted to replace all of the np.NaNs with a zero and I wanted to replace all of the 123s, so one, two, three point zero with let's change it to being 200, for example.

So I'll run this and we should see that all of my missing data, including the num has been turned to zero and then my value of 123 has been turned into the number 200, so what have I passed in as the object here? I've passed in a dictionary, so the key is what replace is looking for and the value is what it's going to replace it with, so you could build up an enormous dictionary of transformations you want to perform on your data or map things between one value and another and just run replace on it and it will map all of the values. It's not necessarily dynamic, but this is just looking for specific values and changing them to specific other ones.

We also have something called map. I can use map to perform tasks using functions that I have defined. A simple one would be round. Let's multiple all of my heights by point nine five, so I'm going to get a few floating points. Well, one, one, one, get some floating point numbers in there and I'm going to reassign this here, so I've got some floating point numbers.

If I wanted to round these according to some criteria, I can define a function which will do that for me. If I want to define my round, there is round function under these, so I'll pass in a value. With that value I run round. I want to round X to two decimal places, for example. Then I return this. It rounds to two decimal places, so the function I've got, the way the function works is if I run my round, I pass in one point one, one, one, one, one, nine, nine, nine and what I'm going to get is this function rounded to two decimal places and that's simple enough. 

If I wanted to pass this function over my data series, that's when I would use map. I would say I want to take my heights data series. I would like to map over it the function round, my round, sorry. I don't need to have brackets here because I'm mapping the function over it. I just pass a reference to this function and then my data series will have its values rounded appropriately. This is passing a function, element-wise, over a data series.

I could map something like abs which would be the absolute value of all these things. I can do something silly here like run print over it. What this will do is it will print out each element and then every element will become a reference to none because print returns nothing. Any function which takes a single argument we can map over this here, np dot square root. I can map that over it, so taking the square root of all my data.

Now, this is a good moment to have a look at lambdas. If you want to have a unique transformation, a simple one line bit of code that you want to run, you could use something like a lambda for that. What I mean by lambda is simply a function that I don't define to have a name. Lambdas are the nameless functions.

If I wanted to .map this using a lambda, I'm having one value go in and what I want to do with that one value is I want to round it to two decimal places. This line of code is equivalent to mapping the my round function over it. This is the exact same thing. So we have lambda. We have what we're going to call a variable going in and what we want to do to that variable, what we want to return, generally small one line functions that we can just create, carry out and then let vanish. We could make this complicated if we like, but it's not really in the vein of lambda, so it's good to know that we have lambdas available to us.

Lectures

About the Author

Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.