Unstructured Data
Start course
1h 5m

Learn the ways in which data comes in many forms and formats with the second course in the Data and Machine Learning series.

Traditionally, machine learning has worked really well with structured data but is not as efficient in solving problems with unstructured data. Deep learning works very well with both structured and unstructured data, and it has had successes in fields like translation, and image classification, and many others. Learn and study how to explain the reasons deep learning is so popular. With many different data types, learn about its different formats, and we'll analyze the vital libraries that allow us to explore and organize data. 

This course is made up of 8 lectures, accompanied by 5 engaging exercises along with their solutions. This course is part of the Data and Machine Learning learning paths from Cloud Academy.

 Learning Objectives

  • Learn and understand the functions of machine learning when confronted with structured and unstructured data
  • Be able to explain the importance of deep learning



The Github repo for this course, including code and datasets, can be found here.


Hello and welcome to this video on Unstructured Data. In this video, you will learn to recognize what unstructured data is and how you can deal with it. As we mentioned earlier, you could be dealing with images, you could be dealing with sound, you could be dealing with text, or even more exotic data types, for example, movies or protein molecular structures or video games and many other types of data. And, so the question is how do you go from rich data types like images and sound to a table where you have columns with features that are numerical in value. This process is called Feature Extraction and the beauty of Deep Learning is that it can handle most of these data and learn optimal ways to represent it for the task you're trying to solve. Let's take images, for example. An image is represented in a computer as a table of pixels. So, for each pixels, we have the values of red, green and blue. And so what our image really is, is a three-dimensional table, where you have rows, columns and three pixel values. 

As we shall see later, this is also called the tenser. So one way to deal with it is to unroll the image in a very long sequence of number. So you would walk along each of the three dimension and at that point your image would be again a list of number where each feature represents the value of a channel in a particular pixel. Now if our images all have the same size, we can arrange them in a list and we are back in a tabular representation, where each row, each record, is an image and each column is the value of the channel in a particular pixel in that image. However, by doing so we have lost most of the useful information in the image. Precisely we have lost the fact that each object is represented in the image by pixels that are correlated with one another. In other words, a single pixel in an image doesn't carry very much information. Most of the information is contained in the fact that nearby pixels are correlated. As we shall see, there is a particular configuration of neural nets called convolutional neural nets that is great at dealing with this type of input. We'll learn about this later in the course. Now let's take sound. When sound is digitally recorded, it's a long series of numbers.

 And this is what a sound looks like if we represent it time. The X axis is the time and the Y axis is the amplitude of the sound. So to represent sound data like this, we could still imagine of having a table where each row is one particular sound and the columns are the values of the samples of that sound. However, we would encounter problems when we have to deal with sounds of different duration because they wouldn't have the same number of columns, and also it would be hard to decide when to start. So what's the beginning of a certain sound? This is the problem of synchronization. Also, sound information is usually carried in modulations of frequency. So using the raw form may not be the best representation. There are better ways that we will learn about later in the course. For example using frequency and a neural network is a great way to encode sound directly for tasks like music recognition or speech to text. Text documents pose similar challenges. If each data point is a document, we need to find a good representation for it if we want to build a model that identifies it. 

There are many techniques to build features from text. For example you could use part of speech or word frequency, but you could also use embeddings, as we shall see later. And in general, Deep Learning is a great technique to tackle extracting features from text. So in conclusion, unstructured data is anything that does not come in a tabular format. For example, images, sound and text. And to deal with it you will need to extract some features. The good news is that Deep Learning is a great tool to do that. Thank you for watching, and see you in the next video.

About the Author
Learning Paths

I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-­founder at Spire, a Y-Combinator-­backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.