The course is part of these learning paths
In this Course, we cover Python Visualization Libraries and Tools, focusing particularly on Marplot and the Seaborn plotting library. You will learn how to use these to visualize your data using Python in a clear and effective way. We will go into depth particularly on Seaborn and you'll learn about the different plot available including regression plots, pairplots, and heat maps.
If you have any feedback relating to this Course, feel free to let us know at firstname.lastname@example.org.
- Use Marplot to create plots to epresent data, and format the plots
- Add information to plots such as labels, titles, legends, etc.
- Get acquainted with the Seaborn plotting library
- Learn how to plot data using Seaborn in a variety of different plots
This Course is intended for data scientists, data engineers, or anybody interested in learning how to use Python tools to visualize data.
To get the most out of this course, you should be familiar with the basics of programming: variables, scope, functions.
The dataset(s) used in this course can be found in the following GitHub repository: https://github.com/cloudacademy/practical-data-science-python
So we're going to look at Seaborn now, so Seaborn is a very nice plotting library. And it allows us to make very professional looking plots very easily. I'm going to set some things start off with, I'm going to set the style to whitegrid, I'm also going to run despine, because I think that makes things look a little bit nicer. So I'm also updating the default parameters. And we'll see exactly what this is going to do in a minute.
So to start off, I'm just going to copy some data from my data frame. So I'm pulling out a sub data frame from this by specifying a list of columns that I would like. And then I'm calling copy, which just takes a copy of the data and puts it in a new frame for me and new data frame. As we can see, I now have a data frame just containing the age, gender, height and weight columns, including the categorical variable final judgment.
So we're going to visualize this in a few ways. I am in fact going to drop my missing values, as well just to make life easier for me as Seaborn isn't a fan of missing values. And this will just give us an idea of how the visualization library works. Okay, so we've dropped our NAs, let's have a look at a few plots we can generate.
So let's start off with a distplot. If I call it sns.distplot. The purpose of a distplot is to show us a distribution formatted nicely, right? So I'm going to copy and paste my figure code in here, because this will make the image large enough for it to be visible on screen. And then I'm going to call sns.distplot. And I'm just going to run that on a column of my data.
So I'm going to run that on height, we should see we get a really nice looking distribution curve, we get a histogram, it's specifically called the Kernel Density Estimate Curve overlaid on top, it's an estimate of the probability distributed.
So now we can see we have a histogram, we have our bins. And the shape of the histogram is how we can obtain distribution. We've got distribution curve, which describes that roughly. So when we have an axis that looks like this, when we've got lots of little numbers, the area under this curve has to add up to one. So what we have are the relative frequencies of each bin of data falling into each of these bins. So you might remember this from our statistics lecture.
So we have the relative frequency on the Y axis. And we can obviously pull out any columns. So these plots are designed to work for a single column of data. And that's all you're going to visualize a single column of data and getting out the probability distribution and the actual histogram for each. There are a few other parameters that we can specify.
So we can specify how many bins we want to have, we can specify whether we want to have the histogram or the kernel density estimate curve, so there are lots of things here that we can specify. And so what's another plot we have at our disposal for observing distributions of data for investigating them picking up on things like outliers. What could we use for that? Well, as we've already seen before in previous lectures, we can use the boxplot, I can just pass it my data for this one. So data is equal to DF. There we go. We've got the comparative boxplots. Now, these shouldn't necessarily be on the same axis. But let's have a look at what I can do.
I could specify specific columns that I want to visualize the breakdown of the data for. So this doesn't look too great right now. But what I want to do is if I wanted to specifically look at age by itself, I can say, Okay, I want to have a look at age. So this is a boxplot of the ages, a zoomed-in boxplot. And I could say, "Okay, what if I wanted to split ages of people by gender?" Well, then I can pull out the gender column as well.
So now I've got two comparative boxplots next to each other and I can visualize these. What if I wanted to then break my pot apart by how many people believe in the final judgment, then I could specify the hue parameter to be given by final judgment. And then what I have is a nice boxplot that shows me the breakdown between males and females. And when it comes to their beliefs in the final judgment, I've got an automatically generated legend showing me relative belief in the final judgment.
So what I've done is I've passed in my entire data set here. So my Seaborn has my data frame, and then I asked it to plot various columns in various places. So then I can just name string values for the column. Now, we also have something called Notching. So here, we can look at notched boxplots. These give us an idea of confidence intervals and median and things like that. So there we have one line of code there, and there you go.
So with boxplots, one of our axes needs to be numeric. Well, it helps for it to be numeric, it generally should be, we can then specify a categorical variable to split by. So we've chosen gender in this case. Then if we have another category that we want to break up our data by, we can specify that as hue.
So you've got to go X, then Y, then hue, to give it an order of how it's split up its data. So Seaborn is just taking care of all of this minutia that you usually have to specify whenever you generate a plot. And as we can see, we can interact with it using matplotlib. So I can generate a figure overlaid this onto my figure. And then I could if I gave this a name, I could then update different aspects of the plot. So it's still just matplotlib but it's a wrapper. It's what we call a wrapper around matplotlib.
So now let's have a look at something called Violin Plots. If I want to create a violin plot, I call sns.violinplot, and I pass in my data. In reality, I can pass in all of this pretty much the exact same, I just have to specify one or two parameters differently. So I want my X to be final judgment this time, I want my Y to be ages, and then I want hue to be given by gender. And then I'm going to say split equals true.
So how can we interpret this plot? What we've done is, we have said, we want final judgment on the X axis. So we've taken our data, we've split it up by final judgment. So that's why I've got final judgment one, two, three, four, five. So these are groups of people essentially, then Y is age. So what I'm wanting to visualize is a continuous variable on the Y axis, then I've specified for this plot specifically, I've tried chosen to go with hue by gender, right? Because gender is, according to this data here, gender is binary. So then we split the plot by gender, because it's binary here.
So if I didn't specify split, then what I would have is lots of plots next to one another. But if you go with split equals true, then I'm sticking them along the same sort of axis. So it's better for comparison.
So for this library, it simplifies things massively by the ability to stick your data in, and then just pull out various columns to visualize them against one another. It's not like matplotlib, where you would have to you know, this would take such a long time to do in matplotlib, whereas with Seaborn, you just say, "Okay, put this on the X axis, put this on the Y axis, color it by this, and then you see what it looks like." And then you try splitting it and so on and so forth. So hopefully you can see the usefulness Seaborn from this.
Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.