Data Wrangling with Pandas
The course is part of this learning path
In this course, we are going to explore techniques to permit advanced data exploration and analysis using Python. We will focus on the Pandas library, focusing on real-life case scenarios to help you to better understand how data can be processed using Pandas.
In particular, we will explore the concept of tidy datasets, the concept of multi-index, and its impact on real datasets, and the concept of concatenating and merging different Pandas objects, with a focus on DataFrames. We’ll look at how to transform a DataFrame, and how to plot results with Pandas.
If you have any feedback relating to this course, feel free to reach out to us at firstname.lastname@example.org.
- Learn what a tidy dataset is and how to tidy data
- Merge and concatenate tidy data
- Learn about multi-level indexing and how to use it
- Transform datasets using advanced data reshape operations with Pandas
- Plot results with Pandas
This course is intended for data scientists, data engineers, or anyone looking to perform data exploration and analysis with Pandas.
To get the most out of this course, you should have some knowledge of Python and the Pandas library. If you want to brush up on Pandas, we recommend taking our Working with Pandas course.
The GitHub repo for this course can be found here.
Welcome back. In this lecture, we are going to investigate two important `pandas` functions that allow you to join two (or more) tidy data sources into one. Let us consider three different `pandas.DataFrame` objects, each of them describing the time series related to June 2020 of a stock quoted on the NASDAQ 100 index.
In particular, we are going to take into account the stock of Amazon, Google, and Microsoft, denoted by `AMZN`, `GOOGL`, and `MSFT`, respectively, just because, as cloud fans, we are very familiar with those, aren’t we? We would like to answer to two questions: 1. How can we join the three data sources into just one? 2. Can we obtain a unique data structure so that each statistical unit is identified by a unique identifier shared by each of the sources? We will use a Jupiter notebook now, so pause this video, open a new notebook and I’ll see you there!
Ok, I’m assuming you have opened an environment that looks like the one you see here. Be sure you have put all the data in the right folder - for instance, I suggest you create a `data` folder in the same path where your jupyter notebook is stored. So let us get started!
We firstly import `pandas` with the standard convention, import pandas as pd. We read the aforementioned data objects as follows: we use the pandas read csv function which requires the file path pointing to the data we want to read in memory. This snippet stores, in the google variable, the dataset stored in the google June 2020 csv file. We set the index col argument equal to 0, meaning that we set the first column of the csv as the index of our dataset. We repeat the same logic for microsoft and amazon as well.
Let us inspect the general structure of the `pandas.DataFrame`: we call the `info()` method on each of the datasets. The info is very useful, since it is going to provide us quantitative information such as the number of entries in the dataframe, or the number of columns as well as qualitative information, such as the type of the columns and the memory used to store the data frame. For instance, we call google.info.
We have an output which shows the memory used to store the data frame, the type of the columns we are considering, and this is repeated also here in this table where it shows that the names of the columns, Dtypes, as well as the number of non-null observations. It will be the same for microsoft and amazon.
What we see is that there is consistency between the data sources; all of them share the same structure in terms of columns and in terms of type, as well as the same number of entries. We have 22 entries for each data source. We have got three different data sources that are consistent with respect to the features characterizing the dataframe. We can inspect, say, the first two rows of the `google` dataset using the method `.head()`. So for instance, google.head will show the first two rows of the google dataset.
So, let us try to answer the first question we asked: How can we join the three data source in just one? To answer the first question, let us firstly understand which is the exact task. We would like to create a dataframe made of four columns (that is Date, Close, Volume and Symbol) that contains all the observations of the three considered data objects.
Put in different terms, we would like to `concatenate` the three dataset into one. How can we do that? Concatentation of `pandas.DataFrame` structures can be done in different ways. <br> One possible solution is to employ the pandas `append()` function on (at least) two data sources: the `append()` simply appends rows of a DataFrame to the end of caller (that is bssically another DataFrame), returning a new object.
So for example I can initialize an empty dataframe using the dataframe function, and then I populate the dataframe using the append function. The append function can take a list of pandas data frames and in our case, it’s google, microsoft and amazon. And this snippet creates a unique DataFrame, namely `all_df`, that is the concatenation of the three aforementioned DataFrames. Indeed, by calling the `info()` method on the new dataframe we see that now we have 66 entries sharing the same structure, and obviously, as the number of entries increases, the memory used increases as well.
Note that unlike the `append()` methods of Python lists, the Pandas `append()` does not modify the original object but it creates a new object with the combined data. That’s why we set the all DF.append to the variable.
A potential problem we have to take into account is that this operation simply appends DataFrame objects, which have their own index. Put in other terms, appending rows of different data sources creates a unique data source, but the resulting index is a mere transposition of the original one. This means that index `0` is repeated as many times as the dataframes we are concatenating. Therefore, it is better practice to reset the index of the new dataframe after an operation of concatenation.
To do that, we employ the `reset_index()` method on the new dataframe. Please note that here we are dropping the index we want to replace, since it is a mere noise coming from an operation of concatenation. In practice, what we have to do is to take the object all_df and apply the reset_index function. In particular, since we are not interested in keeping the original index, we are gonna drop it, by specifying the argument drop equal to True, and also replace the all_df object we have in memory with the new one.
Alternatively, we can achieve the same result more efficiently using the pandas function `concat()`: The pandas concat function concatenates pandas objects along a particular axis.
Remember that `axis=0` in pandas means `rows`, whereas `axis=1` mean columns. Since we want to concatenate rows belonging to different dataframes in one structure, we are going to set `axis=0` in the function. Please, note that this is the default, which means we can avoid its specification in the function call.
In practice, we define all_df to be the result of concatenation thanks to the pd.concat function of a list of pandas objects, namely google, microsoft and amazon. As I said, We don’t have to specify in the function call the argument axis=0 since this is the default. We also reset the index after concatenation, and we print the first 5 rows of the new dataframe. So, as before, we call on the all_df object, the reset index, dropping the index, and replacing the object we have in memory with the new one. And we print the first five rows of the object all_df.
Here we go. We have a dataframe with 66 rows - this is easily accessed with the shape attribute - and 4 columns. Please note that for more information on `concat()` and `append()` functions, you should check the Merge, Join, and Concatenate section of the Pandas documentation, which we will link to in the transcript of this lecture. Ok, that’s all for this lecture. See you in the next one.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.