This course will guide you through all the possible techniques that are used to visualize data using the Matplotlib Python library.
In this course, we will explore the main functionalities of Matplotlib: we will look at how to customize Matplotlib objects, how to use various plotting techniques, and finally, we will focus on how to communicate results.
If you have any feedback related to this course, feel free to contact us at support@cloudacademy.com.
Learning Objectives
- Learn the fundamentals of Python's Matplotlib library and its main features
- Customize objects in Matplotlib
- Create multiple plots in Matplotlib
- Customize plots in Matplotlib (annotations, labels, linestyles, colors, etc)
- Understand the different plot types available
Intended Audience
- Data scientists
- Anyone looking to create plots and visualize data in Matplotlib
Prerequisites
To get the most out of this course, you should already be familiar with using Python, for which you can take our Introduction to Python learning path. Knowledge of Python's Pandas library would also be beneficial and you might want to take our courses Working with Pandas and Data Wrangling with Pandas before embarking on this Matplotlib course.
Resources
The data used in this course can be found in the following GitHub repository: https://github.com/cloudacademy/data-visualization-with-python-using-matplotlib
Welcome back. We now move on to the fundamental concepts related to Data Visualization with Python. Python is a great language with many libraries dedicated to data science that must be in any requirement of a data science project. One of them is Matplotlib. Now Matplotlib is the first data visualization package created in python and has the advantage of being flexible, in the customization of plots.
Before diving into the main commands of Matplotlib it is important to give you a little bit of context and history about this package. Matplotlib was originally developed by John Hunter in 2002 as a patch to IPython to enable interactive MATLAB-style plots from the IPython command line. It became a separate library in 2003, and nowadays it's the default reference for visualization in Python.
In 2016, Matplotlib underwent a major release, version 2.0, which gave the package the graphic restyling necessary to incorporate better style configurations, such as the one inherited from ggplot, an R library for data visualization. That release also improved the flexibility of plot customization, making Matplotlib the milestone of graphical tools in Python. Indeed, there exist various visualization libraries, such as Seaborn, Bokeh, and even Pandas, which are built on top of Matplotlib's API.
But more importantly, if you wish to customize your plot using those libraries, you still need to know the syntax around Matplotlib. Matplotlib has two interfaces, number one, a convenient MATLAB-style state-based interface, which is basically the one that characterized the first version, and two, a more powerful object-oriented interface.
In this course, we will mainly focus on the object-oriented interface, which has become the de facto standard way to use Matplotlib, since it permits users to have better control over more sophisticated plots. This interface makes it convenient to create common layouts of plots, including the enclosing figure object, in a single call. However, note that both interfaces are provided through the pyplot submodule. So now I encourage you to open a jupyter notebook on your local environment, like the one you see here on my screen.
At first, let's import the Matplotlib submodule pyplot with the standard convention plt, as follows. Under the object-oriented interface, creating a figure in Matplotlib is very easy. We just need to call the plt.subplots, which creates two objects by default, the figure and axes. The figure object is a sort of container that holds everything belonging to a picture, so the labels, the title, the legend, and even the plot itself.
The axes object is the part of the plot that contains the actual data. In simple terms, it is the canvas on which we will draw our data. If we run this command, a figure with empty axes is produced, since no data have been taken into account yet. In a python script, remember to call the plt.show function at the end, to show the resulting plot.
In a Jupyter notebook, you can avoid this call by simply adding the following command to the importing cell. This will lead to static images of your plot embedded in the notebook. Using the command Matplotlib notebook, this will lead to interactive plots embedded within the notebook. For our purposes, let's use the static inline mode. The subplots method has many arguments that we will touch on in this course. For the moment, let's just focus on two important ones, nrows and ncols, which control the number of rows and columns in the subplot grid.
By default, both arguments are set to one, which means that the figure object will be dedicated to a single plot. However, if we modify those parameters with, say, nrows equals to two, and ncols equals to one, this will produce two axes objects in a single figure. Okay, once we have understood the framework, we are going to use, that is the Matplotlib object-oriented interface, we can get our hands dirty and try to explore the power of Matplotlib.
So, let's now do a little bit of EDA to understand the structure of the dataset. So first, let's import pandas as pd, and we read the data using the pandas function, read_csv. In particular, we are gonna pick the gapminder dataset that we introduced in lecture one. You'll find the data in the Github repository as part of this course. The filename is gapminder.csv, and we store data in a dataframe object called df. We call the info method to get high-level information about the dataset, and the result of this is as follows.
We see that for a few observations, a few records are missing. We also have two out of eight variables which are not of type numeric, i.e. the region and the country. We then call the describe method to get the statistical distribution of the quantitative variables as follows. Adding data to a figure is done by calling the plot method on the axes object. It basically plots two variables as lines or markers. Let's pick the time series related to United States by filtering the dataframe as follows. We assign to the object df_usa, the original dataset filtered by country, and the country is the US.
In particular, we also reset the index by dropping the original one, since each row corresponds to a different year. And we also plot the first five rows of the dataframe. Thanks to the shape attribute, we see that we have 50 observations, and we want to understand how GDP per capita has evolved over time. We will use two sources of data, the gross domestic product per capita, which is basically measured in each observed year, and the calendar year observed in the column year.
So how can we visualize it? Well, by using the aforementioned plot method on the axes object. Please, note that when calling the plot method, we must pay attention to the order in which columns are specified. First the variable we wish to have on the x axis, and secondly the one on the y axis. The following snippet produces the desired output. We call the plot method on the axes object, and we pass the year column as Xs and the GDP as Ys. So the insight we get is that now the trend is much clearer than reading the data by itself in the dataframe.
Just a quick remark. Technically speaking, in Matplotlib everything that is shown in the figure is called an artist. Therefore, the axes object and everything that is inside it is technically an artist. So you'll hear me using the term artist at various points throughout this course. Another important argument of the subplots method is figsize. This argument allows users to change the figure size of the plot.
The value for the figsize parameter should be passed in the form of a list, where the first value corresponds to the width while the second value corresponds to the height of the graph. If we don't specify anything in the method call, as in this case, the figsize takes the default value specified in the matplotlibrc configuration file. Don't worry too much about this file at this stage of your learning process. We'll dive into that later.
So which is the default value for this parameter? In practice, to get the default value, we call the following snippet. Now we'll come back to this syntax later, so don't worry too much about it for now. So by default, it is set to be equal to six and four inches for width and height, respectively. So for instance, we can set those parameters with figsize equal to 10 and eight. And now the plot is much better in terms of quality and size.
Another argument that is worth mentioning is dpi or dots per inch. This basically controls the plot resolution. This is set by default equal to 80. And this is extremely useful if one wants to express the output in terms of pixels. Indeed, a figure of figsize of width and height will be characterized by the number of pixels, called px, equal to the width times dpi, and the number of pixels for y equal to height times dpi. So, the following snippet produces an output of 640 and 480 pixels.
Okay, so looking at this Matplotlib object, what is the main issue here? Well, this plot is not ready to be shared. Indeed, the final user is not aware of the fact that we are plotting the GDP per capita. So we need to customize the plot in a better way. In the next lecture, we're going to go through different techniques to improve the readability of the plot. So I'll see you there.
Lectures
Course Introduction - Customization in Matplotlib - Multiple Plots in Matplotlib - Annotating Text with Matplotlib - Advanced Customization in Matplotlib - Different Plot Types in Matplotlib - Conclusion
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.