1. Home
  2. Training Library
  3. Amazon Web Services
  4. Courses
  5. Introduction to the Principles and Practice of Amazon Machine Learning

Model Evaluation


Problem types
Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
Start course
2h 12m

When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.

If you've got a real-world need to apply predictive analysis to large data sources - for fraud detection or customer churn analysis, perhaps - then this course has everything you'll need to know to get you going.

James has got the subject completely covered:
  • What exactly machine learning can do
  • Why and when you should use it
  • Working with data sources
  • Manipulating data within Amazon Machine Learning to ensure a successful model
  • Working with machine learning models
  • Generating accurate predictions

Welcome to our course on evaluating Amazon ML models. In this course, we'll cover what it means to evaluate an ML model, the evaluation metrics for each type of model available in Amazon ML. And finally, we'll return to the Amazon ML dashboard and walk through ML insights for the model we created in the previous lecture. So what does it mean to evaluate an ML model? We've talked about this little before, but let's review.

The purpose of creating an ML model is to create a system that can make good predictions when provided with data it hasn't already seen. This is the difference between memorizing, which any database can do, and generalizing. As discussed in previous lectures, we can hold out a percentage of the data set where we already know the answer, and then train the model using the rest of the data. We know the correct answer or ground truth for both sets of data, but the model only knows the ground truth for the training set. Once the model is trained, we can use it to make predictions against the remaining data, and then compare the predictions with the known answer. Amazon does this for us, but it's our job to evaluate the metrics which Amazon ML produces. That will vary for each classification type.

So what kind of metrics does Amazon provide for each type of evaluation? For binary classification, the model output the prediction score. This score is translated into a true or false label based on a confidence threshold. The score itself indicates the certainty that an observation belongs in the true class. True can be defined any way you like it. For example, when evaluating whether or not the email message is spam, you could set up your model to either say true when the model is spam, or you could set it up to say true when a model is a legitimate email. It's up to you. The predictions fall into four groups: positive, negative, false positive and false negative. Metrics for binary classification all relate to these groups.

Accuracy measures the fraction of correct predictions, so all true positives and all true negatives are compared against all false positives and all false negatives. Precision is the ratio of true positives to false positives. Recall measures how many true positives the model predicts as positives, and F1 is the harmonic mean of precision and recall.

Another metric, Area Under the Curve (AUC), measures the ability of a model to predict a higher score on true positives compared to true negatives. It is independent of threshold, so AUC doesn't involve comparing false predictions to true predictions the way that other measures do. Depending on your need, you can adjust the threshold to maximize the least desirable outcome. If predicting true positives is extremely important, then the threshold should be adjusted to increase precision. In this example shown here, we would move the threshold to the right in order to minimize the size of the shaded red and yellow area. On the other hand, if capturing all positives is more important even if it means capturing some false positives, then the threshold can be adjusted to increase recall at the expense of precision. In this example, we would move the threshold to the left to minimize the size of the gray and red area.

In binary classification, the model only needs to pick between two labels for the target: true for membership and false for other. It makes this choice by generating a score for the observation. The score indicates the probability that an observation is in the membership class, and then a threshold, as we saw, is used to divide the scores in a true and false predictions. This multiclass classification, there are more than two classes, but the model still treats each class separately and generates the score indicating that the observation is in a particular class. The important difference is that the threshold is not always required. Each class has an associated probability indicating whether the observation is in the class. The model can then predict the label based on the class with the highest probability.

You might choose this threshold if you do want to reject predictions with universally low probability. The underlying metrics for multiclass classification are the same as binary classification. Each class is, in turn, treated as if it was the true class and all other classes are considered false. The metrics are then averaged over all classes. Amazon Machine Learning uses the average F1 measure to evaluate the class for a success rate. Remember, from our previous section on binary classification, that F1 is a harmonic mean of precision and recall. Amazon presents multiclassifier metrics as confusion matrix. This is a table which shows each class and the percentage of correct and incorrect predictions. Dark blue indicates a high number of correct predictions while dark red indicates a high level of incorrect predictions.

We can read the table row by row or column by column. Reading by row, we see that this model is very good at predicting romance novels because the Romance Romance cell is dark blue. The pale color in the Romance Thriller cell and Romance Adventure cell indicates that there are very few romance novels, and it's classified in these categories. If we read column by column, we see that the model tends to overpredict romance novels. The orange color is Thriller Romance. It indicates that some thriller novels are miscategorized as romance. Likewise, many adventure novels are miscategorized as romance; a perfect model with displayed dark blue line of diagonal cells.

For regression problem, the model needs to pick a numeric value as the target. The metric in this case is root-mean-square error, or RMSE, and the mean absolute percentage error (MAPE). These metrics are meant to measure the distance between the model-predicted value and the ground truth value. Amazon ML uses the RMSE metric to evaluate the predictive accuracy of the model. The histogram for the regression model shows the residuals. A residual is the difference between the ground truth value and the predicted value. It represents the amount of over or underestimation. When the residual histogram is centered on zero and displays a bell-shaped curve, it means that there is no systematic error in the predictions.

If the model is not centered on zero or is not bell-shaped, then there's a systematic error in the model and adding more valuable might help the model capture the real pattern in the data. Note that we're not talking about adding more observations with the same data.

Instead, we are talking about adding more variables in order to create richer data for the model to learn from. In our next demo, we'll walk through the ML model insights for the banking promotion binary classifier that we've built in the previous section.

About the Author
James Counts
Software Developer

James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.

James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.

He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.