Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
Welcome to our course on feature processing in Amazon Machine Learning. In this lecture, we'll cover the concept of feature processing, how we can use feature processing, and which feature processing functions are available with Amazon ML. Finally, we'll wrap up this lecture with a demo that will use the banking promotion data source to create and evaluate a machine learning model.
So what is feature processing? After learning about your data with data insights, you can use feature processing to make the data more meaningful. The best data for machine learning is generalizable. This means that we want data that represents general information about a class of observations and is not overly unique to a specific observation. For example, my thumbprint is data that is overly specific to me as a human, but the fact that I have an opposable thumb might be useful for distinguishing me as a primate rather than a gerbil.
Let's look at some common examples of ways to improve data through feature processing. Time stamps are a great example of data that is too specific to an observation. Even if the model can learn something from the time stamp, it doesn't matter. That time is already in the past and making predictions is about the future. However, a time stamp contains a wealth of more general data points, which we can extract for feature processing. These data points include the hour of the day, day of the week, month of the year, and some more. And these data points are useful. 12 noon will happen again in the future, we hope, and the model may be able to learn if our customers are more or less receptive at lunchtime. All we need to do is transform the input data into this feature and let the machine learning algorithm do the rest.
Another example is missing or invalid data. Sometimes your data is incomplete. For example, this might happen if you've been collecting data over a long period of time and the data that you captured has evolved over time. Let's take one completely made-up example. Let's imagine that there was an e-commerce website that started by selling books. Since they only sold books, there was no concept of a department ID in their transaction records. Later this totally imaginary e-commerce website decided to sell other products and added the department ID to the transaction record. So in a scenario like this, it makes perfect sense for the former bookseller to add the book department ID to records where the ID is missing. Of course, we have to be very careful when changing data. We have to really understand what we are doing when we develop a strategy for dealing with missing or invalid data.
Next, let's consider Cartesian products. We might have data about the population density of the United States. We can use this information to determine which areas are rural and which are suburban and which are urban. These categories might actually provide insight into the classes of people who live in these areas. On the other hand, the observed behavior in these areas might not strictly depend on population density.
We might wonder if statewide culture might influence behavior more, in which case knowing that the observed behavior occurred in a southeastern state versus a northwestern state might provide more insight. We could provide both pieces of information as variables to the machine learning algorithm and that's fine, but we might also wonder if urbanites in one state not only behave different than urbanites in another state, but also behave differently than rural people in their own state. A Cartesian product can produce a new feature that combines these two pieces of information about state and density. For example, we can create features like urban Oregon, suburban Oregon, and rural Oregon and similar features for every other state.
Another example is binning. Some numbers have a linear relationship between the target and the variable value. Price is a classic example. If your target is, "Will this person purchase this book?" then lowering the price generally increases sales and raising it generally lowers sales. This type of value should be treated as a numeric value when building an ML model.
Not all numbers have this type of relationship with the target. For example, age does not have a linear relationship with the book purchasing decision. As a group, 48-year-olds do not generally buy incrementally greater amounts than 47-year-olds. However, it may be true that 40 to 50-year-olds buy in significant quantities, while 20 to 30-year-olds do not. When this is true, we can use binning to convert a numeric value to a categorical value. In many cases, Amazon Machine Learning will decide to bin variables based on its statistical analysis of the data sources and include the transformation in the recipe it provides. Later on, I'll show you how to view this recipe and change it if you like.
It's also worth remembering that you may have domain-specific features that give you ways to combine variables that don't apply to another domain. An easy example of this is volume. If you have width, length, and breadth, you can create the volume feature by multiplying them together. Text data has similar problems to the original time stamp example. Although there is a wealth of valuable information in the text, most text strings aren't general enough for the machine learning algorithm to work with.
One solution is to split the string up into words and Amazon will do this automatically for text data. However, when splitting strings into words, we lose contextual data. Words near each other may have additional meaning. Consider the two sentences. "Don't click on suspicious links" and "Click here to get rich." Both have the word "click" in them, but they mean exactly opposite things. An n-gram transformation extracts features from text while attempting to preserve some context by keeping nearby words near each other. A number of words can vary in length. Simply splitting strings into words can create unigrams, keeping two words together is a bigram, and keeping three words together is a trigram. Fourgram if you put four words together and so on. Amazon supports n-grams up to 10 words. Those are just some examples of feature processing you can do. Some of those examples you can do right in Amazon ML and other examples you will need to do while preparing your data source.
Amazon can apply the following feature transformations to a data source after it has already been created but before the ML model is created: n-gram sizes one to 10, orthogonal sparse bigrams, lowercase transformation, punctuation removal, quantile binning, numeric normalization, and Cartesian product. If these transformations do not meet your needs, then you can, of course, perform any custom transformation you desire using the programming language of your choice during the pre-processing phase but before the data source has been created.
In our next demo, we'll use the banking promotion datasource to create an ML model. We'll examine the default recipe provided by Amazon, make a small change to it, and then build and evaluate the model.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.