The course is part of this learning path
Learn the ways in which data comes in many forms and formats with the second course in the Data and Machine Learning series.
Traditionally, machine learning has worked really well with structured data but is not as efficient in solving problems with unstructured data. Deep learning works very well with both structured and unstructured data, and it has had successes in fields like translation, and image classification, and many others. Learn and study how to explain the reasons deep learning is so popular. With many different data types, learn about its different formats, and we'll analyze the vital libraries that allow us to explore and organize data.
This course is made up of 8 lectures, accompanied by 5 engaging exercises along with their solutions. This course is part of the Data and Machine Learning learning paths from Cloud Academy.
- Learn and understand the functions of machine learning when confronted with structured and unstructured data
- Be able to explain the importance of deep learning
- It would be recommended to complete the Introduction to Data and Machine Learning course, before starting.
Hello and welcome to this video on visual exploration. In this video, we will talk about visual exploration and about matplotlib, the Python library for visualizing data. After an initial look at properties of a tabular dataset using data exploration, it is often useful to dig a little deeper using a few visualizations. In fact, looking at the graph, we may spot a trend, or a particular repeating pattern or a correlation. Our visual brain is a very good pattern recognizer and so it only makes sense to use it and take advantage of it when possible. You can represent data in many ways depending on the type of data and what we are interested in. For example, if data comes from an ordered series of consecutive events, for example, a measurement of temperature coming from the outside or the times a user interacted with a website, it makes sense to use a line plot to represent. A line plot displays each value in a sequence connected with a line. And so it makes it easy to spot trends like, for example, growth or seasonal up and down patterns. On the other hand, take a case where you have a dataset on the population of individuals.
And for each of them, you know their height and their weight. If you were to plot the height and the weight with a line plot, all you would see is just noise because there is no sequence in your dataset. On the other hand, you can let yourself be guided by your intuition that taller people will on average be also heavier and decide to plot weight and height on the same plot as two features in a scatter plot. So in this plot, each point has two co-ordinates, the x co-ordinate will correspond to the height feature and the y co-ordinate will correspond to the weight feature. As you can see when we represent the data in this plot, it is very evident that there is a relationship between weight and height, and that on average, taller people are also heavier. Sometimes, we are not interested in knowing correlation, but we are interested in knowing the frequency of occurrences of data.
For example, still using the population of individuals as an example, you may decide to divide the range of heights into buckets and ask how many people fall into each bucket? This is what's called a histogram and it represents the statistical distribution of our data. Histograms could look like a bell curve, like in this case, where we've represented the frequency of the heights for the male and female subpopulation in our data. But it could also look like other statistical distributions, say exponential or even weird shapes. So a histogram is useful to inquire about the distribution of your data and keep in mind at this point, we've completely lost any ordering information, we've aggregated data by counting.
A close relative of a histogram is the cumulative distribution. The cumulative distribution is useful to answer questions like what is the fraction of our data that falls under a certain value? For example, by looking at this plot, can you tell me which fraction of males falls under the 70 inches bar? And as you will guess, it's roughly about 70%. If you draw a vertical line at 70 inches, it will cross the male cumulative distribution around 0.7. There are many other richer plots that you may be interested in. For example, the box plot if you want to compare different distributions in a very compact way, giving the aggregate statistics. Or pie charts if you want to represent the fraction of a total. We will encounter these and other plots in the exercise, and I encourage you to have a look at the additional resources provided in order to get some inspiration on visualizing the data. In conclusion, remember that the choice of visualization is strongly tied to the kind of data and the kind of question you're asking. So, use the appropriate plot and make sure you know how to use matplotlib and pandas which have very powerful visualizations. Thank you for watching and see you in the next video.
About the Author
I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-founder at Spire, a Y-Combinator-backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.