Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
Welcome to our course on acquiring data for use with Amazon Machine Learning. In this course, we'll talk about how much data you need in order to work with Amazon Machine Learning. Is more always better? And how much is enough? We'll talk about what labeled data is, what its components are, and what we use it for. We'll go over the data formatting basics required by Amazon ML, and finally, we'll wrap up with a demo where we'll acquire some freely available data, clean it up, and send it to Amazon in a format that's compatible with Amazon Machine Learning.
Let's dive right in by talking about how much data you need. The easy answer is it depends. But that answer isn't very helpful, so let's talk about data requirements for Machine Learning. To quote Amazon, "Machine Learning problems start with data, preferably lots of data, for which you already know the target answer." That part in the middle about lots of data can be intimidating, but there's no need to worry about the size of your data. While bigger is generally better, a smaller data set can have some advantages when you're just starting out. So take the data that you have and see what you can make of it.
To make something of it, you train and test a model. Let's take a look at what's involved there. Training and testing a model generally involves these five steps: loading the data set, splitting your data into training and testing subsets, creating an ML model using the training set, and testing the ML model against the testing set. Finally, you evaluate your model's performance. We'll evaluate these steps in detail as we progress through the course, but you can see that in order to know if you have enough data, you just need to try building a model.
If the model performs well, you have enough data, otherwise, you may need to add more. Fortunately, Amazon ML will perform the following steps for you, splitting the data into training and testing subsets, creating the ML model using the training set and testing the ML model against the testing set. So, as a developer, you only need to worry about loading the data set and evaluating the test results. We'll cover the evaluation process in more detail later. For now, we'll remain focused on acquiring the data.
So what if you go through the process of creating a model and it doesn't perform well? Well, in that case, you can try adding more data, but don't assume that adding more data will automatically result in a better model. If the data is skewed in some way, for example, the data predominantly describes males but your customers come in both genders, then simply adding more data may not result in a better model. We often find that ML and big data are discussed in the same circles. The same people who are interested in ML are often interested in big data. Hopefully, by now, you understand that you don't need big data in order to create a model; you only need enough data.
In fact, you want as little data as you can get away with. The smallest data set will be the most efficient to train and evaluate, so you will want to train your model with the smallest amount of data possible, but not less. If the smallest amount of data that will work is big data, then you must provide that amount of data. However, in many cases, this amount of data is not required. Now that we have some rough idea about how much data we might need—it could be a little or it could be a lot or it could be a huge amount—let's talk about what kind of data we need. But you may be wondering what is labeled data. We encounter a lot of jargon when working with Amazon ML. Some of these terms might be new to you and some of them are just different names for the same concept.
So let's make sure that we understand this very important ML term: labeled data. Sometimes you'll hear the term labeled examples and this term means the same thing. Both labeled data and labeled examples refer to data which you provide to a supervised machine learning algorithm so that it can teach itself to learn. Labeled data is different from unlabeled data because labeled data contains within it the correct answer to a question. So if the question is something like, "Will this person buy a movie ticket?" then the label is a part of the data, which explicitly says yes or no. And we know this answer for sure because labeled data is usually historical data.
Because labeled data usually consists of examples of behaviors that we have seen in the past, examples are often referred to as observations. Again, these are just two terms that refer to the same concept.
Besides a label, which is the correct answer to the question, labeled data contains variables, which the ML model can use to identify the correct answer. These variables describe the observation and are often referred to as features. Continuing our example question, "Will this person buy a movie ticket?" variables might include the age, gender, profession or ZIP code or various other pieces of data.
Since the subject we are observing, in this case, is a person, the features will be ways in which people are different from each other, but we can expand our concept of the subject under observation beyond just personality. If we look at the observation as an interaction, we can include features like, "Was this person offered a coupon? or "What time of day is the movie showing?" Later, once we've created the machine learning model, we can use the model with unlabeled data. Unlabeled data is data where we do not know the correct answer to the question. The model provides the answer for us. So remember that the machine learning algorithm and the machine learning model are two different things.
The algorithm builds a model from labeled data and the model analyzes unlabeled data and provides a label. And for one final piece of jargon, remember that providing a label is just another way to say make a prediction.
Now that we know that labeled data is composed of variables and a label, how do we tell Amazon ML what our variables are, and what our label is? In other words, what format does Amazon expect us to provide our data in? Data comes in all shapes and sizes, but Amazon Machine Learning only accepts CSV files. If you are unfamiliar with the term CSV, it is short for comma-separated values. A CSV file is like a table in a database; it has rows and columns. Some CSV files use the first row as headers to describe the contents of the columns. For Amazon ML, each row is an example observation and almost every column is a variable that describes the observation.
The only column that is not a variable is the column that contains the label. While many people will casually use the term CSV to refer to any character delimited format, Amazon is very specific about the requirements. The separator must be a comma, not any other character, such as a semicolon or a tab. You might acquire data from the university or government agency which is advertised as a CSV file, but in reality, uses another character as a separator. So when preparing the new data set, make sure to check the separator and all other formatting requirements before trying to use it as a machine learning data source. Likewise, many people get confused between spreadsheet files and CSV files because they often open in the same application. You can still use spreadsheets as data sources as long as you convert them to CSVs first and you can also use data sources with the wrong separator just as long as you change the separator first. We'll take a closer look at the formatting requirements and how to clean up data in our upcoming demo. In our upcoming demo, we'll go through the process of acquiring, cleaning up and uploading labeled data.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.