Visual Data Exploration with Matplotlib
The course is part of this learning path
Learn the ways in which data comes in many forms and formats with the second course in the Data and Machine Learning series.
Traditionally, machine learning has worked really well with structured data but is not as efficient in solving problems with unstructured data. Deep learning works very well with both structured and unstructured data, and it has had successes in fields like translation, and image classification, and many others. Learn and study how to explain the reasons deep learning is so popular. With many different data types, learn about its different formats, and we'll analyze the vital libraries that allow us to explore and organize data.
This course is made up of 8 lectures, accompanied by 5 engaging exercises along with their solutions. This course is part of the Data and Machine Learning learning paths from Cloud Academy.
- Learn and understand the functions of machine learning when confronted with structured and unstructured data
- Be able to explain the importance of deep learning
- It would be recommended to complete the Introduction to Data and Machine Learning course, before starting.
Hey guys, welcome back. In this video, we will see how to explore data visually using a library called Matplotlib. Let's first generate some data. Do not worry too much about understanding these lines, it's just generating a bunch of fake data. We stack all this data onto a single array and then we import it into a data frame giving it some column names. Basically what we've done is to create the data frame with four columns called data1, data2, data3, data4. This is not really important but feel free to dig deeper into how the random function works, if you're curious. The first plot we mentioned in the lecture is the line plot. This is the default plot in pandas and so if we do df.plot it generates a line plot. This is pretty cool already. We can also plot the data in the data frame using the plt.plot the Matplotlib plt.plot function and passing the data frame. In this case, we have to set the legend by hand by passing the plt.plot legend directly to the cell. Notice that it generates the exact same plot. Colors are automatically chosen and you can change the color scheme if you want. The next plot we've met is the Scatter Plot. This can be generated by choosing the style to be '.' for example, this generates the dots without lines. Or we can also say, data frame plot kind='scatter' but in this case we have to pass what we want for X values and Y values. This is useful for example if we want to check the correlation between two of the columns in our data. Let's see, here we're passing data1 and data2 these are going to be the blue and the orange data so if you plot them on a scatter what we can see is that data2 spreads a bit over a larger set of values for data1 but they're not really correlating in any way. The next plot is the histogram. Notice how the interface is consistent. We always do the df.plot and then we set the kind of plot to be a histogram. In this case, we also have access to some additional parameters which are the number of bins, which is 50. We also set the title of the plot and alpha. Alpha is used to control transparency. You will immediately see why that's useful. Since the histograms overlap, we want to set them a little bit transparent so we can see what's behind. Finally, we set the figure size and probably have set to be too large. So what I'll do is I'll remove the figure size and re-plot this and yeah, this looks much better in video.
Okay, the cumulative distribution is the histogram of the data summed up to a certain value. So, we set normed=True and cumulative=True we also remove the figure size here because we don't need it in the video. And what we see is the plot of the cumulative distribution that we've explained in the class. Finally, we've introduced the Box Plot which is another way of looking at distribution by looking at outliers and the typical interval as well as the position of the peak. This plot is very useful to compare different distributions. For example, we see how narrow the data1 distribution is with respect, for example to the distribution of data3. With Matplotlib, we can also create subplots. Subplots are created this way. First, we set a figure to be agreed of subplots. Here we say the number of columns, the number of rows, sorry, and the number of columns. In this case, I had set the large figure size but, given that we are in the video, I will change it to a small figure size. Then, we assign each of the plots to the corresponding axis so this is going to be the first plot on the grid, this is going to be first row, second column. Second row, first column and second row, second column. Notice that we are changing the style in each plot or the kind and so when we plot this, we will see four different plots with four different styles. I notice here that the layout of chosen makes so that the title overlaps with the axis a little bit. So I can change that by using the plt.tight_layout function, and re-plot and this makes sure that there is some spacing between the plots so it's nice to read.
Okay, this is pretty cool already. Let's look at some other plots. The pie chart is useful as you've seen when you want to indicate the fraction of some data. So let's generate the pie count series that only has two values, the counts of how much data in the column data one is greater than 0.1. So we have a lot of data that is not greater than 0.1, but a few data that are. Then, we generate the pie counts plot using the kind of plot='pie' certain figure size, in this case I set it to be a square. We explode one of the wedges, and you see what that means. We assigned labels, we set the percentages and the shadow and the start angle to be 90 degrees. See here this is an exploded wedge, with a little bit of shadow, these are the two labels and these are the automatic percentages. Just for comparison, let's draw a plot where nothing of these is set so let's command out the explosion, the labels, the auto percentage, the shadow and the start angle. If I plot this, I obtain a singular pie chart where it start from this angle instead of this angle and where we only have the labels but no percentages and no explosion. I hope you agree that my initial plot looks much better.
Yeah. Okay, the hexbin plot is a plot that is useful to show two dimensional distributions. So, we can use it to plot this data that I've just generated that is made of two coordinates, X and Y, that both change overtime so if I plot the hexbin of X and Y, what I see is that it's kind of like a histogram where the bin height is given by the color, and the data is being both in X and Y and you can see that our data is clustered around two positions, both in X and in Y. So in this video, I've shown you some of the capabilities of matplotlib. Matplotlib is a great library and it's very powerful. We've just scratched the surface in this video, but I hope I've made you curious enough to go and look at examples at the gallery and documentation, because it's really, really useful. Before I go, I want to teach you another little trick of the ititlenotebook. If you're inside any function, for example here, where there is the parenthesis, you can hit the keys shift and tab together to show the documentation of the function. So in this case, it's telling us the first few lines of the documentation of the plot function. If I hit the plus here, it shows me all of the documentation of the plotting function. This is really useful when you don't remember exactly how the function should be written, because what I can do for example, is start from df, hit tab once and it will show me all the methods, so we'll look for the plot method oh, cool, I found plot, and then I hit parenthesis and I don't remember what to do. But, I can always hit shift+tab twice and get access to the documentation. So for example, I can see okay, these are the kind of plots that are available: line plot, bar plot, horizontal bar plot, histogram, for example let's see what happens if I do a plot type kde. So I do kind='kde' and see what happens. Cool, so we have the distribution of X and Y, that is smoothened using a kernel density function. I think it's pretty cool that we can access the documentation of a function using the shift+tab method. Thank you for watching and see you in the next video.
About the Author
I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-founder at Spire, a Y-Combinator-backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.