Merging and Joining

Developed with
QA

Contents

keyboard_tab
Working With PANDAS
1
Introduction
PREVIEW12m 29s
2
Arrays
PREVIEW5m 37s

The course is part of this learning path

Practical Data Science with Python
course-steps
5
certification
4
lab-steps
3
description
4
play-arrow
Start course
Overview
DifficultyIntermediate
Duration57m
Students120
Ratings
5/5
starstarstarstarstar

Description

What is Pandas
Pandas is Python’s ETL package for structured data
Built on top of numpy, designed to mimic the functionality of R dataframes
Provides a convenient way to handle tabular data
Can perform all SQL functionalities, including group-by and join.
Compatible with many other Data Science packages, including visualisation packages such as Matplotlib and Seaborn
Defines two main data types:
pandas.Series
pandas.DataFrame

Series
Generalised array --- can be viewed as a table with a single column
It consists of two numpy arrays:
Index array: stores the index of the elements
values array: stores the values of the elements
Each array element has an unique index (ID), contained in a separate index array
If we reorder the series, the index moves with element. So an index will always identify with the same element in the series
Indices do not have to be sequential, they do not even have to be numbers.
Think indices as the primary keys for each row in a single column table

DataFrames
A pandas DataFrame represents a table, it contains
Data in form of rows and columns
Row IDs (the index array, i.e. primary key)
Column names (ID of the columns)
A DataFrame is equivalent to collection of Series with each Series representing a column
The row indices by default start from 0 and increase by one for each subsequent row, but just like Series they can be changed to any collection of objects
Each row index uniquely identifies a particular row. If we reorder the rows, their indices go with them

Group By
Groups are usually used together with reductions
Counting number of rows in each group
my_dataframe.groupby(criteria).size()
Sum of every numerical column in each group
my_dataframe.groupby(criteria).sum()
Mean of every numerical column in each group
my_dataframe.groupby(criteria).mean()

Join
Use DataFrame.merge() as a general method of joining two dataframes:
Works also with series
Joins on the primary keys of the two dataframes (series)

Missing Values
Finding out number of missing values in each column
my_dataframe.isna().sum()
Removing rows
my_dataframe.dropna(axis = 0)
Removing columns
my_dataframe.dropna(axis = 1)
Filling with a value
For all missing values: my_dataframe.fillna(replacement_value)
Different value for each column: my_dataframe.fillna({‘NAME’: ‘UNKNOWN’, ‘AGE’: 0})

Map, Replace, Apply
Map applies a mapping to every element of the dataframe
my_dataframe.map({old1: new1, old2: new2, …})
my_dataframe.map(function)
If we provide map using a dictionary, then any elements not in the keys will be mapped to numpy.nan
Replace applies a mapping to only elements of the dataframe that have been mentioned in the mapping
my_dataframe.replace ({old1: new1, old2: new2, …})
Any elements not in the dictionary keys will not be changed

Transcript

Merging and joining, it's probably best to just have a look at what we're doing for this on screen. So I'm getting my contacts, I've got names and telephone numbers that I'm just going to put in the DataFrame by passing in our dictionary here, then I have my exam results. They're the same ones that are generated above.

So if I have a look at contacts, contacts looks like this. It's just a small DataFrame here. And then exam results. Exam results looks like this. So we should see that we've got some sort of duplication between the names in our various columns. And this is something that we can use to combine our tables.

So we do this using the method merge. If we want to combine by column, there is a join method, which you'll use to join a column on the index of another DataFrame. But nine times out of 10, you're going to be using merge to combine them by column. And its syntax, I think is actually quite straightforward. It's quite simple.

So what we have to do is we have to name one of the tables exam results, we have to say we want to merge, we name the other table that we want to join to. So I'm going to do exam_results.merge with contacts. And I can then specify an argument on, which I'm going to set equal to now. And there we are joining DataFrames.

So we can do an outer join. It defaults to an inner join. But we can do outer joins, we can do left joins, we can do right joins as well. We've got inner, outer, left, right, so on and so forth. So this is using merge. If we have different names for columns that we want to join over, then we can specify specifically, what the left part of the joint is going to be on. And then what the right part is going to be on. Right_on is going to be name as well as in this case.

So we can give different names for columns that I want to join over. And the results are the same, we're still performing the same joint operation. So it's up to us how we want to join. So the difference is that with the bold numbers on the left hand side, you have to have a DataFrame that has an index, which contains the values of a column. And what that will do is it will join a column of data on the index.

So almost whatever is in bold on the left hand side there, but it does the same thing actually. So it's just joining on the numbers here, as opposed to the values within a column. So if you don't want it inner you need to say that it's an outer or left or right, what have you. Now we'll have a look at SQL, I'm specifically going to have a look at an SQL interaction with pandas. What we would need for this to work is to import SQLite.

Now there are various other database connection modules available. SQLite just conveniently comes bundled with Python. It is a no setup database. But pretty much every database interaction engine will be set up in the same way. So what we've done is we're importing SQLite zero configuration SQL library, good for interacting with just quick, easy, simple local SQL databases.

What we have to do is we have to set up a connection and we create a variable, which is a connection object. So you give a name to the database. And then you can have all these various parameters that you can specifically specify and tailor your connection. But this simply tells pandas and Python where the database is. And the database object has a method called cursor. And you use this to set up a cursor object.

Now what a cursor object is, is it's simply the means through which you interact with the database. And in fact, when we're working with pandas, we don't need the cursor. But the cursor is how you pass queries to and from pandas in working with core Python. So we'll just have a quick example of a core parent Python interaction.

So I've run this just now, if I wanted to execute some query on my database, then I would call cur, which is the name of my cursor that I find up here. And then I call dot execute. And then all I'm doing is I'm passing in a multiline string containing my SQL query. I run this query in iterator which means a consumable collection. Every time I pull an element out, that element vanishes from it. I can use the method of fetch one, for example. Fetch one gives me the first element that has matched the query. So the first thing that matches the query, which is giving me everything. Every row from the customer table, if I fetch one, then I get this information. If I keep fetching one, I will continue to get different customers out, until I've exhausted this resource. I don't have any customers left. I can fetch more than one item at a time. But this gives us an example or an idea of how consumable aspect of it works.

So we tend to interact with elements in normal Python as tuples. Pandas makes everything easy for us. Its job is to make everything easy for us. And we don't actually need this cursor. All we need to have is the query that we want to execute. We pass that into the pandas.readsqlquery method. We give it the query, and then we simply point pandas to the database we want to connect to, and the one we want to obtain the statement from. And if we run this here, what we've ended up with is a DataFrame containing all the information about these customers, I just pulled it into a table.

So I've done ahead here, just to get the first few elements out. I could get the entire database stored in this table immediately. So it's nice and easy. So generally, when you're working with SQL databases, you're kind of set up your configuration, you end up with an object that has all the information stored in it. And all you have to do is give that object to pandas, and it will pass a query up, obtain the results, put it in a table for you, and then you can perform whatever analytical tasks you want on it.

So this will work for any sort of query that we might want to think of. But all we have to do if we want to select different things is specified different argument, like if I want to set contact name, contact name column, the address column from customer, where city is equal to London. Now this is going to give me all of my contacts and addresses for those customers whose city is equal to London. We're literally just typing SQL into a multiline string here.

So there are no limits as to the complexity of the query that you want to pass. As long as the results can be passed into a table after. As long as it's tabular, then it's all good. But you'd be hard pressed to generate an SQL query that won't give a result in a table. We have the ability to work with selecting product name, unit price * units in stock as assets. So we can alias values, give them nicknames, from product where product name is like C percentage. So we're looking for zero, one or more letter Cs within our name here, and we're ordering by assets.

As we can see, we've got everything that begins with a C. What do you recommend when you work with an SQL database? You pass the query, the more complexity that you can deal with the SQL query, the better it is, you would just want to have the final results and manipulate within pandas. Pandas is analytical capabilities are much greater than that of SQL to a big join. Like we say, we can select all of these guys here. Selecting o.Id as order Id from order as o, et cetera, et cetera, et cetera. This is a nice complicated query. But again, you're pulling it all down.

So in reality, you could build your query within whatever database desktop app you have, and get an idea of what you want. And then once you figure that out, chuck it into a panda's DataFrame. I would say it's probably not very efficient to try and build an SQL query in a string in Python. It's much better to build it within whatever database you have. And then just copy and paste into here, and it will evaluate.

So when I say it's not very efficient, what I mean is that pandas is designed to manipulate data. If you use pandas purely as a data selection tool, then it's kind of missing the point of pandas. So now let's look at some coding. I've got this string qry is going to be given by this guy here, I could make this an F string where the data is going to be the result of pulling out some specific dates. So I'm going to pass in date is going to be given by the variable date, which I'm going to generate prior to this. Date is equal to the string is equal to the string of this one here.

So I've got the date being the results of this one here. So query, let's just see what happens when we pass in qry. Okay, so what this has done is well, we could have obtained data by some computation, or we could have asked for whatever today's date is, I then defined an F string, and F strings are strings, which upon instantiation will fill in certain parameters. Certain parameters that you've sort of left holes for when they're created. So this string will go looking for a variable called date to fill in this portion, and we've pulled out everything for the date that we have here. So if we have a quick look at qry, you can see that what we've got is the query here and it has substituted in that value for us.

Lectures

About the Author

Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.