Data Visualization using Matplotlib
The course is part of this learning path
This course will guide you through all the possible techniques that are used to visualize data using the Matplotlib Python library.
In this course, we will explore the main functionalities of Matplotlib: we will look at how to customize Matplotlib objects, how to use various plotting techniques, and finally, we will focus on how to communicate results.
If you have any feedback related to this course, feel free to contact us at firstname.lastname@example.org.
- Learn the fundamentals of Python's Matplotlib library and its main features
- Customize objects in Matplotlib
- Create multiple plots in Matplotlib
- Customize plots in Matplotlib (annotations, labels, linestyles, colors, etc)
- Understand the different plot types available
- Data scientists
- Anyone looking to create plots and visualize data in Matplotlib
To get the most out of this course, you should already be familiar with using Python, for which you can take our Introduction to Python learning path. Knowledge of Python's Pandas library would also be beneficial and you might want to take our courses Working with Pandas and Data Wrangling with Pandas before embarking on this Matplotlib course.
The data used in this course can be found in the following GitHub repository: https://github.com/cloudacademy/data-visualization-with-python-using-matplotlib
Welcome back. In this lecture, we are going to explore a very important technique that is used to boost the readability of your plot. More specifically, we are going to add annotations to the axes objects. Annotations are usually referred to as small pieces of text that are used to draw attention to a particular part of a plot.
So now let us import pandas as pd and matplotlib.pyplot as plt. We read the gapminder dataset, and we retain only the records for the country China as shown here. So we're going to use the time series related to births per woman in China, also called total fertility rate. In our dataset, it is expressed by the column fertility. The TFR or total fertility rate is used in demography to indicate the average number of children that would be born to a woman over her lifetime if she were to experience the exact age-specific fertility rates throughout her lifetime.
Before diving into the technicalities, let's create a wrapper that draws the artist into the axis object. For simplicity, I will paste here the method I have created for you called getting series plot. This method will wrap all that we need to plot the desired output, so that we do not need to write redundant code again later. It takes several arguments.
So x expresses the series object we wish to plot on the x-axis and y on the y-axis. So the axes object, the xlabel and ylabel. Xticks_grid, which by default is set to zero. If it's greater than zero, a custom grid is produced. Color is the color we wish to apply to our series. Plot_label, by default, is equal to none. If passed, it's the label we wish to identify the series. Title is none by default. But if passed, the title is shown. And marker argument, which is by default equal to none. If passed, markers are shown in the plot.
So let's create a figure and axes object with the plt.subplots method and we also specify the figsize as 12 and 10. We now call the custom method and we assign it to the object named axes. In particular, we require x to be the column year, the DataFrame china, and y instead is df_china in position fertility. Then, we specify the axes. And the xlabel as Year and ylabel as TFR. We then specify the xticks_grid, which is equal to six, the color is equal to red, and the plot label is equal to the string China. And finally, the title.
So in our case is the evolution of the total fertility rate in China. Finally, we plot this with plt.show and we get this output. Now we can obviously improve the readability of the plot by calling the grid function on the axes and setting it to True. And now we have the grid lines. When presenting this chart, you might want to focus attention on a particular aspect of this data. Let's say that we'd like to show people why the TFR dropped significantly at the beginning of the 1970s.
So as a bit of background info, the Chinese government introduced the two-child policy at that time which imposed a limit of two children per family. The fertility rate at the end of the '60s was approximately 6 children per woman. But by the end of the '70s, this rate fell to approximately 3 children per woman. And at that time, the government imposed a one-child policy over the whole country. This policy was abolished in October 2015.
So we'd like to draw attention to the aforementioned events by drawing an arrow that points to the interesting part of the plot, and adding a text box to explain the event. So let's focus on the date in which the Chinese government introduced the two-child policy. We use the annotate method on the axes object. So we do axes to annotate and we pass a string describing the text we wish to be shown in the plot, so here we go. Two-children policy, that's the text we want and then we put the argument xy, and this is basically a list of coordinates in the plot.
So in our case, we want to draw attention to the event that happened in 1970 with the y value equal to two. So let's have a look. Now, this doesn't look very good, does it? The text looks like it is related to the trough in 1980, but in reality we want to focus on the 1970s. So what can we do here? Perhaps we can move it somewhere else. So to do that, the annotate method takes an option called xytext and that selects the xy position of the text.
So basically, we go back to the annotate function above and we pass the xytext argument expressed here 1964 and two. So in practice, the xytext is where we plot the text box, and this is referred to by the xy coordinates. So that looks much better now but the problem is that it is not clear which point the plot is related to.
So now let's add an arrow that connects the text and data point we want to highlight. So the argument that controls this is arrowprops, which stands for arrow properties. It takes a dictionary that defines the properties of the arrow we would like to draw. If we pass an empty dictionary, this will display a default arrow setting. So this doesn't look correct, so actually we can point out the plot event at 1970, so I need to point out 5.5 here.
So now that looks much better but obviously, we can customize it by filling the dictionary with fields such as arrowstyle, color, and linestyle. So let say we wish to draw an arrow that has the following properties. So we want the arrowstyle made from a transparent arrow with a color equal to green. The linestyle that is dashed. So it's now obvious we are pointing to the year 1970.
Now we can also enrich the plot by adding the historical information of the introduction of the one-child policy in China, which happened in 1980. So to do that, I'm just gonna copy and paste the annotate method we have just used. I'm gonna put in the one-child policy text and that occurred in 1979. And I also change the point at which this occurs, sorry, that should be 79 not 90. And the xytext is basically 1982 and four. And that's also changed the color like so, and there we have it. In fact, we can improve the readability here 'cause actually the point is not a 2.5 but it could be something, for example, 2.8. And there, now it looks much better.
Okay, so you've now arrived at the end of this lecture. You've learned an important tool that will definitely improve the way in which you can present your results. In the next lecture, we are going to cover a more sophisticated topic related to settings and styles in MatPlotLib. So I'll see you there.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.