Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
Welcome to our lecture on understanding prediction in Amazon Machine Learning. In this lecture we'll cover the two different types of prediction results that you can get from Amazon ML. Batch prediction results, and real time prediction results.
Amazon provides batch prediction results as a CSV file. This file is placed into an S3 bucket which you own. The columns and the CSV will vary by model type, but the first row will contain column headers.
Every row of the input file will produce one row in the output file. The predictions will appear in the same order in the output as they did in the input. If an error prevents the model from creating your prediction for a particular input row, then the corresponding output row will be blank in order to preserve the order.
For binary batches the results will contain two columns, best answer and score. The score is a probability that an observation is in the membership class.
The best answer column contains the predictive label. For binary models the predicted label will be one, as long as the score is greater than the cutoff score. Otherwise the label will be zero.
For multi-class batches, there will be many columns. Specifically, there will be one column for each class in the training data, and one column called best answer where AWS will place the predictive label.
Each remaining column will be labeled with the class name and contain the score for each class found in the input data set.
As discussed in previous lectures, each class is treated in turn as a binary classification problem where that class is the membership or true class. Then the probability that the observation is a member of each class is computed and the class with the highest probability is chosen as the best answer. The column for each class will show the probability that the observation is in that particular class. Regression model result files are extremely simple. There is one column labeled score. The column contains the raw numeric prediction for each observation in the input data.
The second type of prediction supported by Amazon is real time predictions. Amazon can provide low latency prediction for interactive applications. When you want to use the low latency option, you provide Amazon with a single input observation and wait for the response. You request a real time prediction by sending unlabeled data to an ATI endpoint in the form of JSON data. Likewise, Amazon responds with JSON data containing the prediction.
Amazon tries to reply quickly to your request. How fast is this low latency option? Amazon says that the Amazon ML system is designed to respond to most online prediction requests within a 100 milliseconds. Like batch prediction, the real time response will vary based on the model type. The response will always contain a field which indicates the type of model that generated the result so that you can process it appropriately.
Binary responses to JSON data will contain a field called predictive label just like the batch results contains the best answer column. The raw score can be found in the predicted scores map, which for a binary model will only have one entry in the map.
The key will be the label and the value will be the score. For multi-class models, the predicted class can be found in the predicted label field. Scores for the predicted class and all other classes can be found in the predicted scores map. The map key will be the class, and the map value is the probability that the observation is in that class.
For regression models the response will only contain the predicted value field as well as the default metadata about the model. For regression models the regression score can be found in the predicted value field.
In our next demo we'll return to our banking promotion and our model and use it to make predictions.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.