Seaborn - Regression Plots, PairPlots and Heat Maps
The course is part of this learning path
In this course we cover Python Visualization Libraries and Tools
So, now we're going to have a look at the final few aspects of Seaborn. So, first of all, we want to look at regression plots. Now, regression plots will draw a line of best fit for us, and they will also shade in a confidence interval within our data. You can use this to initially decide whether we have a trend that we can visualize within our data. The purpose of this is for generating graphics, not for machine learning.
So, we'll take a look at this now. So, Seaborn dot regplot. I'll go with weight versus height for my data frame, and I just want to show command here. What we have is a linear regression plot through an existing scatter plot. This is visualizing initially what trend we have. So, here, we'll do logx is equal to true. Let's see how that goes.
Okay, great, so there you go, there's a logistic curve. So, this is the best fit logistic curve for the data. Arguably, it's probably better than the linear one when we're having a look at the outlier that we have there. Right, so, we have our logistic curve here, and it's somewhat limited by it's only going to estimate a logistic curve. Other curves are available, but it seems that Seaborn can do logistic and linear at this moment in time.
So, this in reality is a scatter plot with a line of best fit. We can separate out as many as the parameters as we did before. With have x, y_bins we want to put things into. We want lots, and lots, and lots of other parameters. It all works roughly the same as before.
So, now I'm going to finish talking about correlation. What we'll have a look at now is some of the convenience plotting functions. They're very good for data exploration. So, we have things called pairplots. So, what a pairplot does is it will take our data, and it will plot every column against every other column.
So, we end up with a matrix of every column versus every other column. The diagonal by default is going to be a probability distribution. And then, by default, off the axis, we're going to get scatter plots. Everything against every other thing. So, it's very good for understanding, in this case, the separability of our data. If I didn't color by gender, then we would have a slightly more mundane looking plot. It would have histograms down the diagonal and then, again, a scatterplot on the x and y-axis. So, it comes into its own when you're adding more parameters that you want to color by. So, here, we have hue Gender, and I don't think we want to do anything else other than that.
Now, we do have control over aspects of this plot. So, I'm going to end up with a PairGrid here. So, let's have a look at how this works, see if we can do this. So, we have a PairGrid, it's a specific type of plot. This is not technically a matplotlib object. This is a specific Seaborn object. We have a look at the grids of this.
So, I've assigned this for variable. I'm hoping that instead of having to define a new plotting function, I can simply specify what I want my diagonal to be given by. So, if instead of having a distribution now, let's see if off-diagonal does something a little bit different. If not, then I'll just generate another plot, and then we'll look at that.
So, hopefully, this is going to evaluate. It's taking longer, which is a good sign. So, you can choose what type of plot you want to have on your diagonal, and also off your diagonal. So, I see that it's done. So, it's overlaying these curves, so a KDE plot is an estimation of the probability distribution. What we're getting is we're getting our default diagonal with a distribution of diagonals with the scatter plot. And then, what's happening is we're overlaying on top of this a kernel density estimate formed by various data.
So, this one here exemplifies it quite well. What we have is a two-dimensional probability curve in terms of how likely something is to be blue down here, and then we have a two-dimensional probability curve of how likely we are to be male in this section here. But because obviously we're visualizing in two dimensions, we're essentially looking down on a probability curve.
So, we're getting contours because, for us to look side on, it would have to be 3D, a three-dimensional thing, which would be very tough to do. So, it's the mostly likely sort of portions of where our data are going to fall. So, we're specifying the number of levels. Say, the number of lines potentially we want to draw on our data. And this is overlaying them onto our scatter plot initially. If we wanted to define a PairGrid for this, so a PairGrid is explicitly defining what makes up one of these, I'm just getting rid of final judgment column and then saying I want to color by gender.
So, I'm setting up in a very similar way, but then, after I set up the grid, I'm then mapping the plots over it. So, this is generating the grid, and then I'm putting extra curves on top of it, but we're utilizing quite complicated things here. So, this is what it would look like without the scatter plot underneath. So, we just get an idea of where the bulk of our data are going to be.
So, generally, whenever anyone runs Seaborn, whenever anyone starts playing around with it first, so the first thing they do is the pairplot 'cause it means they don't have to think about plotting individual scatter plots and things like that for data. It just shows you everything plotted against everything else immediately.
Now, obviously the computational time that this takes increases, I'm going to say, exponentially as the number of columns in your dataset increases because it's not a particularly easy thing to generate. So, we have four columns, so that means we've got 16 plots here. As you keep adding columns, you're going to end up with more and more data being visualized.
So, in reality, the default version of this is just pairplot. Give it some data, say what you want to split it out by, and see if I can separate out my data. I could do hue by final judgements. So, now, what you can see is that we've got it colored by final judgment score. So, it seems that pretty much everyone is the same, but you get the gist from this.
So, now, the final kind of plot I want to look at is heat maps. So, we generally tend to use a Seaborn heat map to visualize correlation between variables. So, I'm taking this subsection of the dataset, horror, thriller, comedy, romantic, sci-fi. This is just a subsection of my data again. I'm then having a look at the top of my data. And there we go, that's what my data looks like.
Now, pandas is a very helpful function for generating correlations. You simply call your data frame, df_corr. What this is going to do is generate a data frame containing the correlation between every variable with every other variable. So, when we take about correlation, we talk about the relationship between two variables. So, we can say, "Yes, this is correlated with that." But correlation tends to be very specific. It has a very specific meaning. It is a description of the strength of the linear relationship between two variables.
Now, two variables can be related in a complicated way, but correlation is only going to tell you the strength of the linear relationship. So, I think of it as, if you plot your data, it describes the straight lineyness of the data. If your data are bang on a straight line, then you have highly correlated data. If your data are spread out a bit from a central straight line, your correlation is going to be lower. If there's no relationship whatsoever, then you'll have minimal correlation. But it's specifically a description of the strength of the linear relationship. So, a correlation ranges between minus one and one. Negative correlation, a negative linear relationship. Positive correlation, positive relationship.
So, we've managed to generate this correlation data frame. So, as we can see, I've taken these columns, and then every column is just correlating itself with every other column. What we do notice about the data frame, and we can have a look and see that if everything is perfectly correlated, so everything should be perfectly correlated with itself.
So, if you plot something against itself, you'll get a perfect straight line plot. So, we can generate heat maps using correlation data frame. So, sns.heatmap, and I'm going to pass in my correlation. Now, for some reason, by default, Seaborn chooses a color scheme like this, which isn't very readable, but at least it still gives you an idea of how it works.
So, we can change the color scheme, and then all these things, obviously, ourselves if we want to. But what this visualizes is cream is highly correlated, and then purple is not very correlated. So, it's not the easiest color scheme to use, so, in fact, we could change it to a slightly better color scheme, which is going to be coolwarm.
So, I'm just going to pass this in as a cmap. It's going to be the color map as given by the coolwarm color palette of Seaborn. And this is a more direct version. Red is hot, meaning very correlated. Blue is cold, meaning not very correlated at all. So, we can specify as well as center is equal to zero. So, what center does is it centers the color map around a value, so that now, instead of zero being blue, very cold, zero is sort of nothing. Red is hot, and then if we had a highly negatively correlated variable, it would be a deep blue color. So, this is a handy thing to be able to generate.
If you just want to have a sort of diagonal version, what it does is it generates an array of trues and falses of the size of your data. So, it allows you to generate something without the unnecessary central column, and it does that using a triangular upper form. So, picture it, print out what everything looks like, and you get the idea. So, you can get rid of the diagonal, and only keep the interesting things.
About the Author
Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.