Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
Welcome to our lecture on improving machine learning models in Amazon ML. In this lecture, we'll cover the model improvement process, model fitness considerations, and options for improving accuracy. As we have discussed in previous lectures, we often need to build and evaluate a model before we even know if we've collected enough data or the right data. So it should come as no surprise that the first model we have built may not perform as well as we would like. Creating the model will be an iterative process. On each iteration, you can make a change to improve performance and every iteration falls the familiar pattern which we have been discussing. After evaluating your model, you have some options if it doesn't perform as well as you would like. You can increase the number of observations by collecting more data. You can add more variables to existing app observations possibly by creating them with feature processing. Finally, you can tune the model parameters in the advanced model creations settings. Once you have made one or more of these changes, you repeat the process by building a new model and evaluating it once more. But how do we know what we might need to change? There are several things to choose from and your best option will depend on exactly why your model is not performing well. Because Amazon ML models are all linear models, there are three basic outcomes. An underfitting model predicts value poorly even when predicting values for data it has already seen. This means this model is unable to capture a relationship between the input examples and the target values.
A balanced model does a good job of predicting values, but don't expect it to be perfect. A good model predicts a pattern, not exact values. An overfitting model appears to predict values very accurately for data it has already seen, but actually does a poor job of predicting values for new data. This means the model has simply memorized answers to familiar data. Each of these outcomes will influence your options for improving accuracy. Underfitting can be caused by a model that is too simple. We can influence the model by adding more variables to the input data. This might involve including more Cartesian products or it might involve tweaking the feature processing parameters, such as increasing n-gram sizes. Finally we can decrease the amount of model regularization used in the advanced model settings. When the model is overfitting, it is actually too flexible so we should take the opposite approach.
Remove some features, discard some Cartesian products or numeric bends, decrease your n-gram sizes, and finally increase the amount of model regularization in the advanced model settings. Of course, we always come back to one of our original questions, that is how much data do we need? If we end up with a poorly performing model, then increasing the number of examples may improve performance. We could also go into the advanced model settings and increase the number of passes taken over the data once the model is learning. Whatever we do, remember, it's an iterative process.
Your first model, your second model, your third model may not perform as well as you would like it to. And depending on its actual performance, you can make adjustments to your data or your model settings.
As always, the true test is how the model performs in evaluation.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.