Data Visualization using Matplotlib
The course is part of this learning path
This course will guide you through all the possible techniques that are used to visualize data using the Matplotlib Python library.
In this course, we will explore the main functionalities of Matplotlib: we will look at how to customize Matplotlib objects, how to use various plotting techniques, and finally, we will focus on how to communicate results.
If you have any feedback related to this course, feel free to contact us at firstname.lastname@example.org.
- Learn the fundamentals of Python's Matplotlib library and its main features
- Customize objects in Matplotlib
- Create multiple plots in Matplotlib
- Customize plots in Matplotlib (annotations, labels, linestyles, colors, etc)
- Understand the different plot types available
- Data scientists
- Anyone looking to create plots and visualize data in Matplotlib
To get the most out of this course, you should already be familiar with using Python, for which you can take our Introduction to Python learning path. Knowledge of Python's Pandas library would also be beneficial and you might want to take our courses Working with Pandas and Data Wrangling with Pandas before embarking on this Matplotlib course.
The data used in this course can be found in the following GitHub repository: https://github.com/cloudacademy/data-visualization-with-python-using-matplotlib
Welcome back. In this lecture, we're going to explore different visualization methods that come with the pyplot submodule available in Matplotlib. So far, we've focused on the plot method, which is applied to the axes object. However, there are many other plot types available in Matplotlib. So it's worth having a look at them to understand their pros and cons.
So let's get going. The first family of plots that I'd like to investigate is bar charts. We're going to use a new dataset here, which consists of the FIFA20 football players stats. This data is available in the GitHub repo for this course. Now we perform the usual imports and we define the df object that reads the fifa20 players stats.csv file, and we print the first five rows.
Now we get a warning, but don't worry too much because it's related to a column of the dataset, namely column with index 74. Now we can remove it. We use the argument use cols, and we say we're just interested in columns with indexes ranging from zero to 74. Okay. So let's print the shape of this data. We have quite a big dataset here, as you can see.
Now, we might be interested, for instance, in understanding the distribution of the nationalities among the players. Which country has the highest number of football players? So we use the groupby method applied to the country column. So we call the dataframe and we basically retain two columns, so country and name of the player, and we apply a groupby on that data frame, and then we count the occurrences. We then assign this object to count countries.
Now, a simple inspection of the first five rows shows us the result. And we can rename the column name with a better name, say number of players. So we do columns equal to number of players. And we can also sort the values. So what we do is we create another object called count country sorted, and basically apply the sort values function to the count country dataframe. And we sort by the column number of players. And we'll also sort, sort the players from the highest to the lowest.
So ascending is equal to false. A simple inspection of the first 10 rows shows the first top 10 countries with respect to the number of players observed in FIFA20. So let's assign this to the object top 10. So to visualize the set of data, we use the bar method on the axes object. So for each row of our dataset, we're going to represent it in terms of a solid bar, i.e. the height of the bar represents the number in that row. In this case, the number of or the count of players by nationality.
So we create a subplots object with figsize, figsize 10 and eight. And we apply to the axes the bar method, which requires the top 10 index and the number of players. And we'll do the color equal to green. We then set the Y label. So we'll set that equal to the number of players by country. Or better yet for each country. And finally we call plt.show.
Now here you can see that the X labels are fine. That is, they're not overlapping each other. But what if we had a hundred countries? So if we had a hundred countries, we change top 100. So here now you can see that the X labels are completely unreadable. Now, luckily we can take care of this, and basically we can use the function set, so set X ticks labels. That requires top 100.index. And we also assign to the argument rotation equal to 90. Ah, sorry, there's an error here. It's 'cause there was an extra S there. Okay. Okay, so maybe there's too much there 100, let's put it to 50. And now you can see it's much better.
Now the next family of plots I want to look at with you is histograms. Now histograms are useful for showing the distribution of values within a variable, say the overall skills profile among all players. One question that we might ask is, is the overall distribution centered around the mean, so that is, can we say that the distribution follows a standard normal distribution?
In statistics, a standard normal random variable X is one with the following characteristics. That is, X is standard normal, or gaussian, when it has a mean of zero and a standard deviation equal to one, described by the following probability density function, with X ranging in real numbers with values in the reals. In general, this is a good benchmark to highlight the fact that the distribution of that variable is centered around the mean. So let's create a method that transforms the raw data into standard Gaussian as follows.
We basically subtract each observation by its mean and divide by its standard deviation. We also remove the null values. So we create a method called normalized data. And this ingests a dataframe, a pandas dataframe, say df. And we create an object called normalized df, and we choose the data frame df in position overall, minus the mean overall skill computed as the application of the mean function to the df in position overall. We then divide that quantity by the standard deviation of the overall skill. We then filter out all the null values by simply applying the, so what we do is we apply the isna method on the normalized df. Like so. And we specify that the series is going to be a pandas dataframe. And then finally we return the normalized df.
So we now plot the histogram of the normalized overall skill data. So let's firstly create a normalized dataframe by calling the normalized data method we just created. Basically we call it and we apply it to df. And we assign that object to the variable normalized df. We create figure and axes objects with a figsize of 10 and eight. 10 and eight. And then we apply the hist function with the argument normalized df in position overall. And we assign a label to it.
So here we'll call it overall skill. And we set the X label. So it's equal to the normalized overall skill. And we set the Y label as the number of observations. We then call plt.show. So this is the normalized distribution of the overall skill observed in all the players recorded in FIFA20. So this looks pretty normal, doesn't it? But we can drill down for more details. And so a natural question now arises, how does Matplotlib decide to divide the data up into different bars? Well, this is controlled by the bins argument.
By default, it is 10, but we can customize it to equal, say, to 70. And this way the gaussian shape is more emphasized. We can go further and ask the following question, does the overall skill vary across countries? So let's pick three countries, for example, England, Spain and Italy. What is their distribution with respect to the overall skill? So for simplicity, let's consider three different dataframes as follows.
So I define the England object as the dataframe filtered by country, and we set it equal to England. And we drop the index. And now we do the same for Italy and for Spain as well. After that, we call normalized data on each of them. So we define England norm as normalized data applied to the England dataset. And as before, we repeat the same logic for Italy and Spain.
Now we plot on the same axes objects the three histograms. So I'm going to paste the above snippet here, but instead of plotting normalized df, we plot England, and Italy and Spain. And we also need to change the labels as well. So here it's going to be England, and the same for Italy and Spain. And we can also add a legend as well. Like so.
So looking at this, it's not really very clear what's going on, right? So it's worth generating three distinct histograms into three distinct axes by setting N rows is equal to three and N cols equal to one on the plt.subplots object. And we require those objects to be indexed correctly. And we're not going to use any legend for the moment. And then we set the index also for the Y label and the X label.
Okay, so now looking at this we can infer that England looks quite normal, but for Spain and Italy, the distributions do not look normal. However, note that we can easily customize the overlapping. So what that means is we can eliminate it by changing the type of histogram that is used. So this is controlled by the argument hist type, which is equal to bar by default. So instead of indexing the axes, we set N rows equal to one, and we require the hiss type of the hist object to be equal to step. So this will show thin lines instead of solid bars so that we avoid occlusion of data.
So now we repeat the same for Italy and Spain. And now we plot the legend again for a matter of readability. So here, the graph is now a lot more understandable. Even though the distributions are plotted altogether in the same plot, we can still obtain some information from this. So this concludes the lecture on the different plot types in Matplotlib. When you are ready, continue on to the next lecture where we will wrap up the concepts we have covered in this course.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.