Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
Now that we have an ML model, we can use it to make predictions. In Amazon Machine Learning there are two ways to make predictions. Bulk predictions and online predictions. We'll start by making a bulk prediction.
So I log back into the console and access the Amazon Machine Learning dashboard. I'll access our model by clicking on it, and to create a batch prediction I'll go ahead and click here to generate batch predictions.
The first step is to specify the data we would like to use for prediction. Normally we would use unlabeled data for making predictions, but in this case our entire banking promotion data set is a historical archive. We don't have access to new data.
We use 70% of the data to train the model, and I'm taking the remaining 30% and removing the labels to convert it back into unlabeled data. I've uploaded this data into S3, but I have not yet turned it into a data source. So I'll choose the second option here, which indicates that I need to create a data source. I'll provide a name, banking sample sounds fine, and indicate its location in S3. I'm still using our CA model demo bucket, and the sample data is in samples CSV. The first row still contains column names, so I'll click yes here. Now click verify. Again, you may get a prompt at this point indicating that you need to grant Machine Learning access to the bucket, but I've already done that in the past. And after the data is validated basically checking to see that the schemer matches and that it's valid CSV. You can click continue.
Next, AWSML would like to know where to put the batch results. The output from the batch prediction process will be a CSV file containing the predictions, and AWS would like to know where to put that file.
So I'll just put that back in the same bucket, and AWS will automatically create a sub-folder within the bucket to put the files into.
So here we can go on and click review, and of course on the review screen we can look at the choices we made and go back and change any of them if we need to. In this case we've chosen a model, an input location and an output location, and there aren't very many settings beyond that. So we'll click finish, and the batch prediction has been successfully created in the initial status of pending. And as we should know by now, it's going to take a few minus for the Amazon Machine Learning system to pick up this job and do the prediction. So I'll pause here and wait for it to finish.
Back in the AWS Machine Learning dashboard we can see that our batch prediction is completed. So let's go ahead and click on it's name to view its summary information, and there's not a lot to see here. We can download the log if we need to, and also get the ID or the data source to navigate to the data source that we used in the ML model idea to navigate to the ML model.
There is an S3 URL for the output, but we can't click on it. So I've got to open the S3 console in another tab up here, and if I refresh this we should see a new folder, and we do, and the batch predictions are placed into the sub-folder here. And they each come with a manifest, which we can view. So I'll download that inside a result file, which I can also download.
It's just an archive file, and we can go look in our download folder to see what we have. We'll start with the manifest.
We'll start with the manifest and we can open that in a text editor. We can see there's just simply a pointer out from the input to the output. So go ahead and look at the output. All right. Click on it.
Choose open to unpack it. And although it has no extension, it is actually a CSV file. So I'll go ahead and rename it for our convenience. And I'll say "yes" that I do want to add the CSV extension. And I will open that in our text editor. And we can see that it is a CSV file. And for each record in the input batch now we have two columns, the best answer column and the score.
Since this is binary classification, our best answer is always a zero or one, and that's the label that the model chose, and our score is the raw score computed. And it's written in scientific notation, so each of these answers ranges between zero and one. And for answers that were sufficiently high like the fourth one above the 50% cutoff threshold, it predicts one, and for lower values that had a score of less than 0.5, the prediction was zero.
Now remember that this is a binary classification model, and these are the results from a binary classification model, and the results for other types of classification will be slightly different.
But we cover that back in our lecture. Finally, you may be wondering how do you relate these answers back to actual information. When dealing with our banking promotion, we want to know which of these customers are likely to say yes to our offer so we can contact them, and there's no relationship here. But if you remember back when we were creating our original data sources, we noted that there was no row ID. These were anonymous values. We didn't have anything like a customer number or anything like that in our original data. But in a real life scenario you would, and you wouldn't designate that as the row ID, and then the row ID would be included in the CSV file so that you could relate these predictions back to your actual customers, and then take action.
So that's it for batch predictions. In our next segment we'll go ahead and take a look at making online predictions or on demand predictions.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.