In this Course, we cover Python Visualization Libraries and Tools, focusing particularly on Marplot and the Seaborn plotting library. You will learn how to use these to visualize your data using Python in a clear and effective way. We will go into depth particularly on Seaborn and you'll learn about the different plot available including regression plots, pairplots, and heat maps.
If you have any feedback relating to this Course, feel free to let us know at email@example.com.
- Use Marplot to create plots to epresent data, and format the plots
- Add information to plots such as labels, titles, legends, etc.
- Get acquainted with the Seaborn plotting library
- Learn how to plot data using Seaborn in a variety of different plots
This Course is intended for data scientists, data engineers, or anybody interested in learning how to use Python tools to visualize data.
To get the most out of this course, you should be familiar with the basics of programming: variables, scope, functions.
The dataset(s) used in this course can be found in the following GitHub repository: https://github.com/cloudacademy/practical-data-science-python
We're now going to look at visualization in Python. So Python visualization, specifically matplotlib. So the way that we build figures with matplotlib with pictures and images is we start off with this blank canvas called a figure. And onto a figure, we lay as many sets of axes as we want. When I say axes, I mean literally just axes. So things have scales on them. And onto those axes, we add what get called artists. So lines, bars, squiggles, dots, whatever we like. We then have to layer three individual aspects together to actually get our figure we're going to start off looking at the object oriented API, there's an API mode, or way of interacting with matplotlib, which is literally MATLAB like, where you just call PLT.plot this plot that labels, that sort of thing, and you can work with it as if it were MATLAB.
So we're going to go through in detail the object oriented way, and then we'll have a look at what the structure of MATLAB whey will look like as well. The first thing we want to do is we want to import matplotlib. It's a bit of a mouthful matplotlib.pyplot if you start hitting tab, then you tend to get what you want. So matplotlib.pyplot, as plt. And once we've imported that, we'll just start building up our figure. Then we're going to pull in some data and build a nice scatter plot. And then we'll have a look at Seaborn which is a fantastic wrapper around matplotlib that it sort of takes the fuss out of creating graphics away for us, but we need to know how to interact with the underlying objects.
So I'm going to generate some data, I'm going to generate some x data using NumPy, np.arange of numbers between zero and two pi. Two times np.pi in steps of 0.01. All I'm doing here is generating data, what I'm going to end up with is an array containing all the numbers between zero and two pi in steps of 0.01. So we're just going to have everything going up by 0.01.
So we're not going to generate some y data right now. But if we wanted to, we could create some y data by just passing this into a function and seeing whatever the transformation is going to be. But all we want right now is the x data. So what we want to do, we want to generate a figure the blank canvas upon which we overlay everything, and I'm going to call it fig. Fig is going to be given by plt.figure. And I'm just going to pass in a single parameter, which is going to be called fig size. I'm going to set fig size equal to 15 by eight, this means that my graphic is going to be larger than the default. And the numbers here are actually in inches as well by default.
So we have a 15 by 8 inch figure. If we want to create a set of axes on that figure, in fact, I'll call it ax1, ax1 is going to be given by fig, my figure. Okay, so we've got fig.add_subplot. I'm going to put in three magic numbers one there, one, and one. I'm going to explain my magic numbers in a minute. And then I'm going to pass in title is going to be sine of x. Again, this is optional, but I'm going to plot a graph of sine of x, a nice wavy graphic. The last thing I have to do is take my axes and plot my data on them.
So my x axis, my x data is going to be given by x, my y data is going to be given by the sine function of x. This is a graph of x versus the sine of x. And we should get a somewhat familiar function coming out here, I think we should hopefully have seen, seen this before, so it's just a wavy function. So in terms of this, overall, the whole structure is the figure. The axes are obviously the axes. And then what we call the artist is the line that has been overlaid on the axes. Now, what these numbers define are the dimensions of my figure in terms of the number of sets of axes I want to have.
So I'm saying this is going to be a one by one grid of axes. And the set of axes I'm adding right now is the first in a one by one grid, there's obviously only going to be one so we don't need to worry about now, nobody ever understands this the first time I say it, so what we're going to do is we're just going to add another figurer on to this and then add more and more and more, until eventually you get the picture.
So if I wanted to add another set of axes, I'm going to put two here and then two here, because I've now got a two by one grid, this is going to be the first location, this is going to be the second location, I'm going to add a graph of cosine, cosine of x. To do that, what I'm going to do is change it to np.cosine of x. And what we would see is that I now have two sets of axes on my figure, 'cause my figure has two rows, and one column, row number one, row number two. So the first two numbers are rows and columns. The number after that is pointing to which set, is pointing to which set of axes I should be plotting on.
So this literally just outlines the dimensions of my subplots. The final number is the one that's actually saying the first one, the second one, when we have a grid, it goes from top left to bottom right. So it will be top left, top middle, top right, middle left, middle middle, middle right, etc, etc, etc. Now, what I'm going to do is I'm just going to keep plotting more sets of axes on my graphic, just to sort of hammer home the point that I'm trying to make here. There we go, good. As we can see, we now have because we've got two rows and two columns, we have four figures, well, four sets of axes. The first one is my sine of x graph. The second one is my cosine of x. The third one is my tan of x, the fourth one cosine of x. So you've got all these different ones here in their graphs. So obviously, we can add as many axes as we like to our figure. But we may want to overlay multiple artists onto the same set of axes within a figure.
Now to do that, we would just have to call plot multiple times, for example. So I'm just going to drill down, I'm going to take this code here, and I'm going to move into another cell. So something to note always to bear in mind, if you're generating a figure, all the codes to generate that figure should be in the same cell in the Jupiter notebook. Jupiter doesn't handle it well, when you have your figure plotting code in different cells, it's not very good with that. So just keep all the code in the same cell, we're going to take the first figure we generated of sine of x. And what we're going to do is we're going to overlay cosine of x on to that. So to do that, I can actually just take this code here, but point instead to another set of axes. So pointing to the original set of axes that I was working with. And we can see that we've just overlaid cosine as well as sine.
Now I can add things like label this so that it is called label is equal to sine. So I can have a label let's go ahead and put plt.legend. Actually, no, we can put in here, ax1.legend. There we go. Right we can have a legend and we can specify things like LOC is equal to seven. So numerated positions that we can put our legend in, essentially, so if when I plot, I pass on a label for what I'm plotting matplotlib will remember that.
So then when I asked for a legend, it'll know how to allocate that. So there are some also some terrible legend locations that you can choose. There's one that's right in the middle of the graphic, for example, you've got it right in the middle of things. Now we want to load in some data, so we can start plotting actual values.
If we copy this code here, we should be able to pull in the responses CS. All I'm going to do is loop through the names of the columns in my data frame, printing them out. There's a lot of columns, and that's why I've done this. If we run this code, and it looks like we don't get any errors, then everything's fine. So this is the data that I'm going to be using to plot until we get to Seaborne. So what this contains is survey data from a survey of a Slovakian business class about various attitudes and things about the students in the class.
So a lot of the data are just categorical. But we have columns like ages and weights and things like that, that we can plot. We also have some interesting things about whether people believe in the final judgment or not, so things like that as well.
So what are we going to do? We're going to plot a scatter plot just to get an overview of what our heights and our weights data looks like for this data frame. So I'm going to take out the weights and the heights from my data frame. I'm going to set x to be given by my df_resp, which means data framework responses. I'm just going to pull out the weights or weight and then from my Y, I'm going to pull out the heights.
So we can see that this is a relationship between height and weight, so to do this, I'm going to need to generate a figure like I did before. So I'm actually just going to copy the figure generation code that I have from before, but change the title, obviously. So I'm going to throw this in. And I'm going to have a figure of this size, I want to add a subplot. I've got one, one, and one is my values, the title is going to be something like height versus weight. So that will do for now.
So I'm just generating this figure. And we can see when we don't actually overlay data into it, we can get an idea of what a blank canvas looks like at this moment in time. But I want to add the data. So what method am I going to have to call? Plot, so plot x versus y. And then I get this figure here. So plot automatically generates a line plot. And we're actually looking for the method called scatter and that will generate for us a scatter plot, good.
So we can notice from my data that there appears to be a positive association between height and weight, which is what you would imagine, if I want to update things on my axes, this is why I've called the object oriented API. Because in computer programming, when you're working with objects, you tend to have to find what are called getter and setter methods. And what these do is they get you a value from an object, or they will change the value contained within an object. And with matplotlib, we can interact with figures and accesses and things like that. We can interact with our fingers by calling get and set. And we can specify what we want our x label to be given by.
So for example, for us, it's going to be height. And we can similarly set our y label, which sets the drawing on the y axis by weight. So now we have our labeled axes of height, versus weight. And every other aspect that we would like to change is here. And we can specify what we would like to change. But again, always using our setters and getters, we can update the frequency with which we would like to have things on our x axis. So they get called ticks. So setting x ticks, what I could do is I could specify a more detailed array of values that I want the x axis to be given by, so it's defaulted to me every 20 I can make it so it gives me every 10 by simply specifying I want to be np.arange of numbers between 20 and 170, in steps of 10 plt.show. If you put plt.show at the end of your box, then you won't get any strange output. And as we can see, we've increased the frequency with which we are labeling our axes. We've not actually changed where they are, but just changed the frequency.
Now, it's entirely up to us how many we want to have. If I say between 20 and 70 matplotlib will do exactly what I tell it to do, it will give me between 20 and 70 in steps of 10. So it will just stop. And we can do the exact same thing for the y axis. So it is just copying the syntax to the y axis. We'll go between 50 and 220 in steps of five, so we'll get a very frequent axis here, if we want to.
Now we can dynamically decide where we want our limits of our axes to be as well setting x lim and y lim. So I'm just going to show this in brief because in reality, this is just going to be us lying because we're going to zoom in on that portion of our data and ignore the outliers. It would be nice if we could just do this zoom in on the good bits and work with only the good bits. All I've done here is I've specified a portion of the data that I want to actually have a look at but because this is cheating, I'm going to commentate it out. We're just going to, we're using set_xlim and set_ylim for this. With matplotlib, we can specify all sorts of different values, we can specify within our scatter plot.
If you have a look at the documentation, we have lots and lots of parameters that we can specify. We have things like s and c. So s dictates a column where you would like to size your data by and c dictates how you would want to color your data. The marker is something you can update. So you can have marker as equal to asterisk which will make every marker, hang on.
So if we put this in here, marker is equal to asterisk, and then everything is star there. So here we've got little star data points. And we can specify what we want to change the size of our values by using s giving the name of a column for example, df_resp of final judgment. So our data points are now sized by how much people believe in the final judgment.
Now, because all these numbers are quite small, if I multiply these by a scale of 10, or 50, we should be able to see that we've got different sized data points given different values for how much they believe in the final judgment. I'll go down to 30 or something like that, so we can actually see better. So we're displaying an extra dimension here.
We can do things like color by gender. So coloring by gender is easy enough. Getting a legend of the coloration scheme that we decided to go for is somewhat more challenging. So I'm going to move everything onto new lines just so that we can get an idea of what's going on. So let's see if we can do our marker. So in reality, what we can do is we can call a plot twice, specifying different markers for each time we call plot. So this should run, we've now got marker. The marker style has now gone back to being circular. I'm coloring by gender here. So all I've done is I've done a replace method call. So instead of defining a dictionary of replacements here, I've just passed in the list of values I know that are in that column, and then the list of the values I want to change them to. So it's essentially a dictionary, male will become zero female becomes one, not a number then becomes two.
So here you can see the color scheme that is the default. If we want to change it, we might want to do cmap is equal to summer that changes the color. See, it's a bit more summery. There are an awful lot of color maps for matplotlib more than you can think of, and you can define your own as well. So we won't cover that but you can define your own color maps, for example, in the company colors if you're working for a company or anything you want, you just have to define your own cmap with very specific colors. So this is one way of creating a scatter plot colored by gender.