Learn the ways in which data comes in many forms and formats with the second course in the Data and Machine Learning series.
Traditionally, machine learning has worked really well with structured data but is not as efficient in solving problems with unstructured data. Deep learning works very well with both structured and unstructured data, and it has had successes in fields like translation, and image classification, and many others. Learn and study how to explain the reasons deep learning is so popular. With many different data types, learn about its different formats, and we'll analyze the vital libraries that allow us to explore and organize data.
This course is made up of 8 lectures, accompanied by 5 engaging exercises along with their solutions. This course is part of the Data and Machine Learning learning paths from Cloud Academy.
- Learn and understand the functions of machine learning when confronted with structured and unstructured data
- Be able to explain the importance of deep learning
- It would be recommended to complete the Introduction to Data and Machine Learning course, before starting.
The Github repo for this course, including code and datasets, can be found here.
Hello and welcome back. In this video, we're going to talk about tabular data. First we are going to introduce what tabular data is. Then we're going to talk about where you can find it. And then we will talk about features and feature engineering and how deep learning is useful for feature engineering. So let's start with tabular data. Tabular data is the simpler data you can feed to a machine learning model. It's called tabular because it can be represented in a table with rows and columns. You can think of these as Excel spreadsheets, but very common are files called CSV, which stands for comma-separated values, or TSV, which is tab-separated values. There are many other formats which have a tabular nature. Essentially, it's a file where data is organized in rows and columns. You find tables also in databases or when you have a collection of files that are all interconnected with keys that have relations with one another.
So that's also tabular data, data is arranged in tables and some of the columns of a table relates to columns of another table, but that's still called tabular data. So, let's use an example to define some common vocabulary that will be used throughout the course. A row in a table corresponds to a data point. It's often referred to as a record. A record is a list of attributes that a data point has, often these are numbers but sometimes they are categories and they could be expressed in text. These numbers that describe a record are also called features. In machine learning, a feature is an individual measurable property of something that is being observed. In other words, features are the properties we are using to categorize our data. These features could be directly measurable. For example, think of the number of times a user visited your website, or the browser that they used, or the time that they visited your website at and so on. Or features could also be inferred from other features. In this case, they are called calculated features or engineered. For example, that's when you calculate the average time between two user visits from the history of all the visits of the same user. So the process of calculating new features goes under the name of feature engineering. We also need to consider that there are two different types of data. Data could be continuous or discrete. Discrete means data can only assume certain values, discrete and different from one another. This is called categorical data. Continuous means that data can take any value in the real axis, for example. So here are a few examples of what categorical features and continuous features could look like.
Categorical features could be something like eye color or which courses you've taken at university or the gender or binary feature, for example, is it a spam email or not? Continuous features are things like height or the weight of an object, the speed of something. In general, anything you can measure. Another important thing to keep in mind is that not all the features can be as informative. Some features may be completely irrelevant for what we are trying to do. For example, if you're trying to predict how likely a user is to buy your product, probably his first and last name will have no predictive power in that. Or in the example you're seeing, if you're trying to predict if user is going to pay off his credit or not, its' user ID, client one or client two will probably have no predictive power. On the other hand, previous purchases in the case of a user, or previous insolvency may carry a lot of information in terms of the predictive power that they have. Traditionally, a lot of emphasis has been attributed to feature engineering and feature selection. These were considered, and are still considered, very important steps in building a machine learning model.
Deep learning, however, brought a revolution in that because it solves the problem by automatically figuring out the important features and building higher order combinations of simpler features deeper in the network. So for example, as you will learn, a network that recognizes objects in an image will build deeper and more complex hierarchical representations as you go deeper in the architecture. This is another reason why deep learning is so popular. It automates the complicated process of engineering features that work for the problem you have to solve. So in conclusion, in this video, we talked about tabular data and how it's organized in columns and rows. We talked about discrete and continuous features. We talked about measured and calculated, or engineered features. And we explained how deep learning is actually feature learning, is very good at figuring out complex features that will help you solve a problem. I hope you enjoyed this video, see you in the next.
I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-founder at Spire, a Y-Combinator-backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.