Merging and Joining


Working With PANDAS
PREVIEW12m 29s

The course is part of this learning path

Start course

Pandas is Python’s ETL package for structured data and is built on top of NumPy, designed to mimic the functionality of R data frames. It provides a convenient way to handle tabular data and can perform all SQL functionalities, including group-by and join. Furthermore, it's compatible with many other Data Science packages, including visualization packages such as Matplotlib and Seaborn.

In this course, we are going to explore Pandas and show you how it can be used as a powerful data manipulation tool. We'll start off by looking at arrays, queries, and dataframes, and then we'll look specifically at the groupby function before rounding off the course by looking at how the merge and join methods can be used in Pandas.

If you have any feedback relating to this course, feel free to tell us about it at

Learning Objectives

  • Understand the fundamentals of the Pandas library in Python and how it is used to handle data
  • Learn how to work with arrays, queries, and dataframes
  • Learn how to use the groupby, merge, and join methods in Pandas

Intended Audience

This Course is intended for data engineers, data scientists, or anyone who wants to use the Pandas library in Python for handling data.


To get the most out of this course, you should already have a good working knowledge of Python and data visualization techniques.


The dataset(s) used in the course can be found in the following GitHub repo: 


Merging and joining, it's probably best to just have a look at what we're doing for this on screen. So I'm getting my contacts, I've got names and telephone numbers that I'm just going to put in the DataFrame by passing in our dictionary here, then I have my exam results. They're the same ones that are generated above.

So if I have a look at contacts, contacts looks like this. It's just a small DataFrame here. And then exam results. Exam results looks like this. So we should see that we've got some sort of duplication between the names in our various columns. And this is something that we can use to combine our tables.

So we do this using the method merge. If we want to combine by column, there is a join method, which you'll use to join a column on the index of another DataFrame. But nine times out of 10, you're going to be using merge to combine them by column. And its syntax, I think is actually quite straightforward. It's quite simple.

So what we have to do is we have to name one of the tables exam results, we have to say we want to merge, we name the other table that we want to join to. So I'm going to do exam_results.merge with contacts. And I can then specify an argument on, which I'm going to set equal to now. And there we are joining DataFrames.

So we can do an outer join. It defaults to an inner join. But we can do outer joins, we can do left joins, we can do right joins as well. We've got inner, outer, left, right, so on and so forth. So this is using merge. If we have different names for columns that we want to join over, then we can specify specifically, what the left part of the joint is going to be on. And then what the right part is going to be on. Right_on is going to be name as well as in this case.

So we can give different names for columns that I want to join over. And the results are the same, we're still performing the same joint operation. So it's up to us how we want to join. So the difference is that with the bold numbers on the left hand side, you have to have a DataFrame that has an index, which contains the values of a column. And what that will do is it will join a column of data on the index.

So almost whatever is in bold on the left hand side there, but it does the same thing actually. So it's just joining on the numbers here, as opposed to the values within a column. So if you don't want it inner you need to say that it's an outer or left or right, what have you. Now we'll have a look at SQL, I'm specifically going to have a look at an SQL interaction with pandas. What we would need for this to work is to import SQLite.

Now there are various other database connection modules available. SQLite just conveniently comes bundled with Python. It is a no setup database. But pretty much every database interaction engine will be set up in the same way. So what we've done is we're importing SQLite zero configuration SQL library, good for interacting with just quick, easy, simple local SQL databases.

What we have to do is we have to set up a connection and we create a variable, which is a connection object. So you give a name to the database. And then you can have all these various parameters that you can specifically specify and tailor your connection. But this simply tells pandas and Python where the database is. And the database object has a method called cursor. And you use this to set up a cursor object.

Now what a cursor object is, is it's simply the means through which you interact with the database. And in fact, when we're working with pandas, we don't need the cursor. But the cursor is how you pass queries to and from pandas in working with core Python. So we'll just have a quick example of a core parent Python interaction.

So I've run this just now, if I wanted to execute some query on my database, then I would call cur, which is the name of my cursor that I find up here. And then I call dot execute. And then all I'm doing is I'm passing in a multiline string containing my SQL query. I run this query in iterator which means a consumable collection. Every time I pull an element out, that element vanishes from it. I can use the method of fetch one, for example. Fetch one gives me the first element that has matched the query. So the first thing that matches the query, which is giving me everything. Every row from the customer table, if I fetch one, then I get this information. If I keep fetching one, I will continue to get different customers out, until I've exhausted this resource. I don't have any customers left. I can fetch more than one item at a time. But this gives us an example or an idea of how consumable aspect of it works.

So we tend to interact with elements in normal Python as tuples. Pandas makes everything easy for us. Its job is to make everything easy for us. And we don't actually need this cursor. All we need to have is the query that we want to execute. We pass that into the pandas.readsqlquery method. We give it the query, and then we simply point pandas to the database we want to connect to, and the one we want to obtain the statement from. And if we run this here, what we've ended up with is a DataFrame containing all the information about these customers, I just pulled it into a table.

So I've done ahead here, just to get the first few elements out. I could get the entire database stored in this table immediately. So it's nice and easy. So generally, when you're working with SQL databases, you're kind of set up your configuration, you end up with an object that has all the information stored in it. And all you have to do is give that object to pandas, and it will pass a query up, obtain the results, put it in a table for you, and then you can perform whatever analytical tasks you want on it.

So this will work for any sort of query that we might want to think of. But all we have to do if we want to select different things is specified different argument, like if I want to set contact name, contact name column, the address column from customer, where city is equal to London. Now this is going to give me all of my contacts and addresses for those customers whose city is equal to London. We're literally just typing SQL into a multiline string here.

So there are no limits as to the complexity of the query that you want to pass. As long as the results can be passed into a table after. As long as it's tabular, then it's all good. But you'd be hard pressed to generate an SQL query that won't give a result in a table. We have the ability to work with selecting product name, unit price * units in stock as assets. So we can alias values, give them nicknames, from product where product name is like C percentage. So we're looking for zero, one or more letter Cs within our name here, and we're ordering by assets.

As we can see, we've got everything that begins with a C. What do you recommend when you work with an SQL database? You pass the query, the more complexity that you can deal with the SQL query, the better it is, you would just want to have the final results and manipulate within pandas. Pandas is analytical capabilities are much greater than that of SQL to a big join. Like we say, we can select all of these guys here. Selecting o.Id as order Id from order as o, et cetera, et cetera, et cetera. This is a nice complicated query. But again, you're pulling it all down.

So in reality, you could build your query within whatever database desktop app you have, and get an idea of what you want. And then once you figure that out, chuck it into a panda's DataFrame. I would say it's probably not very efficient to try and build an SQL query in a string in Python. It's much better to build it within whatever database you have. And then just copy and paste into here, and it will evaluate.

So when I say it's not very efficient, what I mean is that pandas is designed to manipulate data. If you use pandas purely as a data selection tool, then it's kind of missing the point of pandas. So now let's look at some coding. I've got this string qry is going to be given by this guy here, I could make this an F string where the data is going to be the result of pulling out some specific dates. So I'm going to pass in date is going to be given by the variable date, which I'm going to generate prior to this. Date is equal to the string is equal to the string of this one here.

So I've got the date being the results of this one here. So query, let's just see what happens when we pass in qry. Okay, so what this has done is well, we could have obtained data by some computation, or we could have asked for whatever today's date is, I then defined an F string, and F strings are strings, which upon instantiation will fill in certain parameters. Certain parameters that you've sort of left holes for when they're created. So this string will go looking for a variable called date to fill in this portion, and we've pulled out everything for the date that we have here. So if we have a quick look at qry, you can see that what we've got is the query here and it has substituted in that value for us.


About the Author

Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.