This course will guide you through all the possible techniques that are used to visualize data using the Matplotlib Python library.
In this course, we will explore the main functionalities of Matplotlib: we will look at how to customize Matplotlib objects, how to use various plotting techniques, and finally, we will focus on how to communicate results.
If you have any feedback related to this course, feel free to contact us at support@cloudacademy.com.
Learning Objectives
- Learn the fundamentals of Python's Matplotlib library and its main features
- Customize objects in Matplotlib
- Create multiple plots in Matplotlib
- Customize plots in Matplotlib (annotations, labels, linestyles, colors, etc)
- Understand the different plot types available
Intended Audience
- Data scientists
- Anyone looking to create plots and visualize data in Matplotlib
Prerequisites
To get the most out of this course, you should already be familiar with using Python, for which you can take our Introduction to Python learning path. Knowledge of Python's Pandas library would also be beneficial and you might want to take our courses Working with Pandas and Data Wrangling with Pandas before embarking on this Matplotlib course.
Resources
The data used in this course can be found in the following GitHub repository: https://github.com/cloudacademy/data-visualization-with-python-using-matplotlib
Welcome back. In the last lecture, we produced a plot that showed the USA's GDP per capita series. But suppose we want to compare two GDP per capita series, say USA versus China. How would we do that? Well, MatPlotLib allows us to perform this task quit easily in several ways and the objective of this lecture is to explore all of them.
So first, let's import pandas as pd and matplotlib.pyplot as plt as shown here. We then import the gapminder dataset using the pandas readcsv function and we filter the data frame with by the country column by selecting all the observations related to the USA.
Now, we create a data frame called df_china which is basically the original dataset filtered by the country column by selecting all the observations for the country China. So we put the country China, like so and we also reset the index as follows. We can create a subplot object containing two series in different ways.
For instance, the easiest way is to call several plot methods on the axes object so that each single artist is shown in the figure and we do that as follows. We create two objects called fig and ax using the plt.subplots function and we basically apply to the axis object the plot method applied to the df_usa data frame and we put in position year and GDP and we also assign a linestyle with a solid line and we'll also assign a color as well equal to blue and finally, a label that denotes the artist with the name USA.
So we basically apply the same logic for df_china so we want to do the same thing again for China, we just change the label to China and we will make it red, we will made the line red instead. We customize it by just assigning explicative characteristics of the plot such as the xlabel to the axes object with the label year and we assign the y label, GDP per capita. And we'll also assign a legend. So we will call the legend method on the axes object and we require it to have a title and we'll put that to Country.
Finally, we call the plot method to show the output. So can you spot a potential problem here? Well, if you look at the raw data, the magnitude of the GDP columns is different between the two countries. So plotting these series in this way does not explain the real difference in GDP per capita between the two countries.
So what I mean by that is that the series for China seems to be flat or rather shrunk towards zero especially where compared to the USA one which actually isn't true. Indeed, when we look at the raw data, it seems China has undergone an incredible surge in the first decade of the new millennium but that is not showed here in our plot. So how can we fix this problem?
Well, there are at least two ways. First of all, we can create two different plots in just one figure using the ncols and nrows arguments of the subplots method. So, let us just copy and paste the previous snippet here and we are going to plot the two series separately so that the China one is plotted below the USA one.
Now this is achieved by specifying nrows is equal to two and ncols equal to one in the subplots method. The subplots now generate two different axes inside the figure and each of them has a specific index. So to plot each single artist in a specific axis, we need to specify the index of the axis as follows.
So the USA series is going to be plotted on the first axis whereas the series related to China on the second axis is denoted with one as index. We now enrich the plot by adding xlabels and ylabels but we have to be careful now. Here we need to specify where to add the specific labels so we will apply the method to the indexed axes objects as follows.
So the ylabel here is associated to the first axis object and we are gonna have another ylabel related to the China series and so we apply set_ylabel to the second axis. Please note that we can just apply the set_xlabel to a single axis basically to the last series we want to output and so we apply set_xlabel to that.
For the moment, we are not going to plot a legend. And we can also improve the readability of the plot by fixing the figsize argument, so let's put here 12 and 10 and we can also apply the grid method so we put true in there and we also add a title on the first axis object so here we can put the string that we want to show as title.
So let's put the evolution of GDP per capita, USA versus China. And we can also reduce the size of the ylabel like this. To be consistent, let's apply the grid to the ax zero object as well and we will do the same for the second ax object that is ax one. So that looks much better now.
Now we can spot a few patterns that were not visible in the first plot. However, there might be situations in which we wish to plot the two series in the same plot. In this case, the measurements are different in magnitude and plotting the respective series in the same plot might be a little bit confusing. So a possible solution is to create two separate y-axis scales using the twinx method and then calling two plots with different colors in the same figure but on two different axes. So how can we do that?
Well, first we create the subplot object, like so and we specify the figsize. We then add the artist to the axis object as we have done before. We then create a new axis by employing the twinx method on the existing axis ax. This means the two plots share the same x-axis but the y-axes are separate. So we then add to this new axis the second artist as follows.
So firstly, we basically create the second axis by applying the twinx method to the axis object and then we apply the plot method to the new axes for the df_china series. So I am just copying and pasting from the previous line to save time. And we also assign a label to the new y-axis this is the new axis applied to the new axis object that is ax2 and we give the label GDP per capita for China. And the result is as follows.
So what can we improve in this plot? Well, it is not clear which series is related to USA and which is related to China. So, again, we have at least two options, we can either add a legend or customizing xticks and labels with an appropriate color. So let us start by adding a legend. With twin axes, this is not straight forward but we need to manually create the legend using the get_legend_handles_labels method on the axes object.
So we are going to create two objects called lines and handles by applying the get legend handles labels method. Basically, the lines are the ones characterizing the artists whereas the labels are the ones we specify manually when we call the plot. And we repeat the same argument for the second axis and we also denote the lines and handles and we denote those with lines2 to handles2. And we finally create a legend object by applying it to the second axis as follows and then we plot the result and it looks a lot better, right?
Alternatively, instead of a legend, we can customize the two y-axes ticks and values with the color used to identify the lines. This is easily achievable using the tick_params method applied to the axes object. Now tick_params requires two arguments, the axis to which the parameters are applied, in this case the y-axis denoted by the string y and a series of arguments that controls the color of the ticks.
So basically what we are going to do now is we're going to replicate the previous argument. What we are going to do here is we're going to create the axes object, the ticks_params with color equal to blue and we repeat the same argument for the second axis, let's change the color to red and we can also control the colors of the corresponding y labels and so when we call the y labels, we can specify the color red for China so we put the color equal to r and for the US we can put color is equal to blue.
So in this lecture, we have covered multiple plots and we've understood when it is useful to use the twinx method. In the next lecture, we are going to add another important skill to our visualization toolkit which is the annotation of text in a plot. So when you are ready continue on to the next lecture and I will see you there.
Lectures
Course Introduction - Introduction to Matplotlib - Customization in Matplotlib - Annotating Text with Matplotlib - Advanced Customization in Matplotlib - Different Plot Types in Matplotlib - Conclusion
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.