Exploration with Pandas
The course is part of these learning paths
Learn the ways in which data comes in many forms and formats with the second course in the Data and Machine Learning series.
Traditionally, machine learning has worked really well with structured data but is not as efficient in solving problems with unstructured data. Deep learning works very well with both structured and unstructured data, and it has had successes in fields like translation, and image classification, and many others. Learn and study how to explain the reasons deep learning is so popular. With many different data types, learn about its different formats, and we'll analyze the vital libraries that allow us to explore and organize data.
This course is made up of 8 lectures, accompanied by 5 engaging exercises along with their solutions. This course is part of the Data and Machine Learning learning paths from Cloud Academy.
- Learn and understand the functions of machine learning when confronted with structured and unstructured data
- Be able to explain the importance of deep learning
- It would be recommended to complete the Introduction to Data and Machine Learning course, before starting.
The Github repo for this course, including code and datasets, can be found here.
Hello and welcome to this video on data exploration. In this video we will talk about data exploration and about pandas by the library that we will use to explore data. When you're building a predictive model it's often helpful to get some quick facts about your data set, by doing so you may spot some very evident problems or low hanging fruits that you may want to address first. The initial phase is called data exploration and it consists of a series of questions that you may want to ask. Here are some of the things you may want to check. For example, how big is your data set? How many features do you have? Is any of the records corrupted or do you have missing features or missing data? Are the features numbers; are they categories? What is the data type of each feature? How is each feature distributed? What is the histogram? Are they correlated? And so on. Python comes with a library that allows to address all these questions very easily and it's called pandas. Pandas is an open-source library that provides high performance easy to use data structures and data analysis tools. It can load data from a multitude of sources. For example, here you see it's loader function. It can read from CSV, it can read from Excel, it can also read from databases in CQL, it can read from JSON, but it can also import from a variety of proprietary formats like Stata, SAS and so on. So here is an example of some data I have loaded for you. Let's see how it works. First of all, we need to import pandas in our import line. We also import Maplotlib and Numpy like the previous time. Then we will use the read CSV function to read from the Titanic training data set that we have stored in the data folder. Let's execute this cell, this cell stores the content of the titanic training data set onto a variable called DF. We've called it DF because the type this variable is data frame. As you can see, it's a pandas core frame data frame. Since it's a data frame, it exposes some useful methods. For example, the head method. The head method shows the first five lines of a data frame. One thing I should mention is like you can think of a data frame as an Excel spreadsheet for Python.
It's an object with the columns and column names and rows that are indexed by an index. As you can see, our data is a set of passengers from the Titanic ship. For each passenger which is in a row, we have the passenger ID, whether the passenger survived or died and a bunch of other attributes like the class, the name, the sex, the age and so on. Let's see what we can do with this data. By asking for info on the data frame, we get a summary of all the features contained in the data frame which are the columns and also how many non-null values we have in each column. You can see that we don't know the age of a few passengers and we don't know the cabin of many passengers. We also get some useful information about the type of the data containing that column. For example, this variable PAR-C-H which stands for parents and children, it's an integer because we cannot have half a parent or a third of a children. Where as for example, age is a floating point variable because one can be three-and-a-half years old. When you see the word object here it means that the variable here is a string. If we use the DF dot describe method, we get some summary statistics about the numerical columns. So we can for example, see that the mean age is almost 30 years old, but the standard deviation is about half of that; about 15 years. We also get the quartiles, the minimum and the maximum in that particular column. Pandas allows to index our data in various ways. For example, we can retrieve a record by it's ordinal position, so if we want the fourth record, remember we start counting from zero, we do DF dot ILOC stands for integer location of three and this will give us all the data about the fourth passenger in the table. let's check that it's true. This person is called Futrelle, Mrs. Jacques Heath and if I check at line number three, there is Mrs. Futrelle Jacques. Correct. Also, we can retrieve the location by using the lOC identifier with the row or the rows and the columns by name.
In this case we have the column ticket and we retrieve the first five rows of just the ticket column. Notice that we can do the same thing by asking for the head of the ticket column. These are exactly equivalent. Finally, we can ask for multiple columns by providing a list of the column names in square brackets. Here I use the head to only retrieve five of them. Pandas allows to perform selections with different conditions. For example, we can select the passengers whose age is greater than 70 by first defining a condition over age and then encapsulating that in the square bracket selector. So here if I run this cell I get the list of the five passengers whose age is greater than 70. Notice that the condition, age, column age greater than 70 generates a long list of trues and false as long as the data set. This is why if we pass this onto the square bracket selector we only retain the values the records in the data frame where this expression is true. We can also perform a query to obtain the same result. Query takes a string as input and returns the data frame view as out. Conditions can be combined with Boolean conditions, so for example, here we are asking for all the records where the condition of age equal 11 is true and the condition of sibling and spouses equal to five is true. As you can see, there is only one person that has five siblings and is age 11. In this case instead we are asking for all the records where the age is 11 or the number of sibling and spouses is five, and notice that this OR is not exclusive so we can have both to be true or only one to be true.
Finally we can perform the same type of Boolean aggregations; Boolean conditions using the query method. The unique method gives us what we expect, the unique values in a certain column. So if I ask for the unique port of embarkment, I can get that there are four possible values. The three ports of embarkment and null values because for some people we don't know the port of embarkment. We can sort the data frame by column and also decide the order of sorting, whether we wanted ascending or descending. For example, in this case with line asking to sort the data frame by age in reverse order with the oldest people first. Notice that I've also retrieved just the first five lines using the head command. Pandas also allows you to perform aggregations and group BI operations like you would do in CQL. It can also reshuffle data into pivot tables like you do in a spreadsheet. This makes it very powerful for data exploration and also for simple feature engineering, so I strongly recommend that you give a look at the wonderful documentation they have if you've never used it.
With Pandas you can also calculate correlations between features and this makes it easier to spot redundant information or uninformative columns. For example you can see that the passenger ID column carries almost zero correlation with the survival of the person on the Titanic which makes sense since the passenger ID is essentially a random number assigned to the passenger. Where as for example, other features such as their gender or the class they were traveling in carry much higher positive or negative correlation with survival. So in conclusion, to explore data you need to be asking questions. Questions that may lead you to discoveries of low hanging fruits or evident problems in your data. Thank you for watching and see you in the next video.
I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-founder at Spire, a Y-Combinator-backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.