Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
Welcome to our course on analyzing data with Amazon Machine Learning. In this lecture, we'll talk about ways to analyze data and what we can expect when creating a datasource in Amazon ML. Finally, we'll wrap up this section by creating our first datasource using the banking promotion data set from our previous lectures.
When working with data, we typically use visualizations or descriptive statistics to get some understanding of what the data means. Visualizations include graphs and charts, while descriptive statistics are numbers that summarize data, like an average or a median. Later, after we've learned a bit about our data, we can use our knowledge to help make a better ML model. This might be as simple as removing variables that have little correlation with the target value, or it can involve combining or transforming variables into new values that have more general predictive power. We could take a continuous variable, like age, for example, and transform it into a bin value like 18-35 year-olds, or people over 50.
Amazon ML includes a feature called "Data Insite" which will compute statistics and create visualizations for us. We can use these visualizations and statistics to evaluate our data before we create an ML model. The type of statistics and visualizations will vary depending on the type of input data.
For numeric variables, Amazon will produce a distribution histogram. We can look at a histogram shape to get an idea of how the data is distributed. We might have a classic bell curve or the data might be skewed in one direction or another, or it could even have several peaks. Histograms are a standard visualization and the higher the bar, the more values fall into that range. Depending on the variable, you might expect any of these distributions, so there is no universally correct distribution. However, if you do expect a certain distribution, this is your chance to check it.
In addition to histogram visualization, Amazon will show us some descriptive statistics for numeric variables. These include min-max ranges, the mean and the median. These values will not be shown as a plot in Amazon ML, but I've included a visualization here for illustration purposes. Like the distribution histogram, you can use these values as a sanity check. Do they match up with your expectations? For example, if negative numbers are invalid, say for an age variable, then a range which includes negative numbers should alert us to a problem with the data.
For binary data, we also have distribution histograms. But since binary data can only take two values, these diagrams are a little less interesting. Although I've picked one that is basically even, you might see the value skewed towards positive or negative, but you should never see a bell curve, there just aren't enough different categories.
Again, there is no universally correct distribution. You have to ask yourself if the distribution makes sense in your scenario. The descriptive statistics for binary data might also be useful. But they, like the histogram, might also induce a yawn. The percentage of true statistics should correspond to the distribution histogram, so there's no news there. The invalid value statistic will tell you how many observations, at some value, other than zero or one.
For categorical data, Amazon chose a histogram which illustrates the count by value for that category. As you can see here, the histogram is sorted in descending order. So you shouldn't expect to see too many patterns in terms of bell curves or skewness. It doesn't mean a whole lot except that the value on the left was the most common value.
Amazon also provides descriptive statistics for categorical data. In this case, Amazon will report the number of unique values encountered. The most frequent categories, and also the least frequent. Least frequent may be interesting because as you saw in the histogram, this might have been hidden by the rightmost column, which was just labeled "other."
Text data is the only type of data where Amazon only provides descriptive statistics, no histogram. Given the large number of words you might expect to find in text data, this makes sense. As you can see in this example, we have 8,750 unique words. Charting these words will result in a large number of categories to plot. We can still use descriptive statistics to get a feel for how many words found in our data overall, the range of words found in each example, and the range of word lengths. The first few prominent words are shown in the summary table and we can click on them to expand the list and see more data. Again, you may or may not know what to expect from this data when you start. If you do have an idea what your data looks like, you can use these stats to double check your assumptions. Otherwise, if you're just getting to know your data, these insights provide your first clues about the shape of the data.
Finally, for all data types, Amazon computes a correlation to a target value, which gives you an idea of the impact that the variable has on the target or the prediction that you're looking for. For the most part, you don't need to do anything with this data. You can still include variables that have a low correlation, as long as your model continues to perform well. But if your model does not perform well, you may want to remove some variables from consideration or process them into new features that will perform better. We'll cover this in a later lecture.
Now that you have an idea of what you will see when you create an ML datasource, how do you interpret this data? There's no hard and fast rule. You can't always know when looking at the stats whether or not the data seems right, especially if you're just beginning to explore the meaning of the data you've gathered. On the other hand, if you have worked with the data for a while, and you do know something about it, these stats can serve as a sanity check. Do you expect an equal distribution of men and women in your data? If so, you know you have a problem if your stats show an imbalance toward one gender or the other.
Pay attention to missing values. These can damage the quality of your end model. If you can add the missing value, or provide a reasonable default, then go back and do so. When Amazon ML encounters an observation with missing data, then that observation is rejected and not used to train the model. Remember that if 10,000 observations in a data source are rejected, then the entire data source will be rejected. So even if your data is not rejected entirely, be aware that you might not be getting the full benefit of the data you collected, when you have missing data. It makes sense to go back to the preparation stage and clean up your data by adding new observations or filling in sensible defaults if possible, then try again. The same can be said for invalid values. Remember that binary data must be submitted to zero for false, and one for true. Other values like yes, no, or N/A are not valid and the model won't be able to use them to learn.
Numeric data is similar; it must be a number. N/A or some other place holder value cannot be processed as numeric data. Again, you have two choices: add more data with valid values or determine how to clean up or transform the values during preparation so that the data is valid.
We spoke briefly about variable target correlation. Variables with low correlation might be considered noise and could damage your model's predictive power. However, it's really too early at this stage to make that determination. We can learn more about the data by building and evaluating a model and then deciding if we need to reconsider the included variables after we see how the model performs. In our next demo, we'll create a datasource from the banking promotion data, which we've prepared in a previous lecture.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.