Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
So we've already looked at making batch predictions with our Banking Promotion Model. But there are situations where you don't want to make predictions all at once. You just want to do them on demand.
Amazon calls those "Online Predictions". In order to do that, we're going to need to enable Online Predictions in our model.
And then I'll show you how to make Online Predictions from a Python project. These are actually rest API calls, so you could use any language that can interact with a web service, but we'll stick with Python in our example. So I'll click on "Banking Promotion Model". Previously we clicked on "Generate Batch Predictions" in order to load a data set and make our predictions, but in this case we'll go down and enable real time predictions. Amazon will need to enable the API for that. And that is what we'll want to do. So we'll click "Confirm". And it'll take a few moments to do that. But by the time we're done working on our code, it should be ready to go. So I'll open PyCharm, where I've already created a new project called "Predictions" to hold our code that we'll use to access the API. In the data folder, I've placed a json file. And this json file corresponds to our sample csv file, which I uploaded in the section on batch predictions. It's the same set of data, it's the same slice of data that I previously held out out for evaluation.
Except in this case I've saved it as json, just to make it a little bit easier to consume. And just like with the batch predictions, I've removed the label, so I've converted it back to unlabeled data so that we can simulate the process of labeling unlabeled data. So let's get started by creating a Python file in our project. So we'll right-click on the project folder, and choose New, Python File. And I will call this "prediction.py". This creates our Python file for us, and I'll delete this little author note. Besides creating our Python file, the first thing we're going to need to do is add a reference to the boto3 Python package. So in PyCharm we do that by going to Preference. And we click on the Project Preferences here. And go to the Interpreter section. And this is the list of packages that we have available right now. So I'll click the plus sign to add a new package. And I'll search for boto3. And there it is. So I'll just click "Install Package". And PyCharm will use pip to bring down boto3 and all of its dependencies. And it lets us know when it completes that task. So we can close this out. We see these are all of our new packages that we added. The boto3 and the things boto3 needs. So I'll click OK. And now we can import session from boto3. And get to work on creating connection to Amazon machine learning. The first thing we want to do is construct a new session object. And to create this session, we will need to provide boto3 with our aws access key ID, and an aws secret access key, for a user which has permission to do the things that we need to do in machine learning. For the user that I'm going to use in this example, I've simply granted that user access to all machine learning permissions in the aws console. And in order to keep this information secret and off of the lecture video, I've created a module with two functions, called "config", which provides these two values to us without having to reveal them in the video. So I'm going to pause a moment and add that module to our project. Okay, and we're back. And you see that I've added the config.py module to our project, which provides the two values we need. The only thing we need to do to use this module is to add an import statement at the top of this file. So these two lines will create a session, configured with the user that I set up in aws. And then create a machine learning client against that session. As we noted, Amazon machine learning is only currently available in one region. And so that's us-east-1, and we just provide the name "machinelearning". And we'll use this ml client to access the rest of the functions we'll need in Amazon machine learning. The next thing we'll want to do is read our data file so that we can access some unlabeled data and send it up to Amazon. So I've added an import statement up top for the json package, so that we can read our json file. And then I've added some code to read the json file itself, opening it as a readable file from the data directory, and then reading all the data into a string called "data". And then using the json lotus function to load that data string, and convert it into Python data objects. So basically I'm telling the json package, here's some Python data, serialize this json, load it up for me, and give me back Python data structures. Which in this case is going to be an array of dictionaries.
If we look at our json file, we have an array here. That's the outer data structure. And then on the second line we have our first json object, which in Python is going to serialize it back as a map or a dictionary. So this is our first one at location 0, 1, and so on. There's probably 11,000 or so of these objects here. So now we have an array of Python dictionaries, 11,000 of them, in our customer's object. So the next thing we'll want to do is get a reference to our model, so that we can send some data to it and get a prediction. So we'll start by creating a try:except block. This will allow us to see any errors that might be generated while we're trying to access the Amazon ml API. The first thing we'll want to do is get a reference to the model. And to access the model in Amazon ml, we'll need a model ID, which is readily available in the dashboard. We can see it in the dashoboard, and we can see it on the summary screen up here. And once we have that model ID, we can use our machinelearning client to get an instance of that model. And once we have an instance of that model, we can query it for the EndpointInfo. Specifically, we're interested in the URL of the input that Amazon has created for us to use this model. And now that we have the model and the endpoint, we can request our prediction. To request our prediction, we'll use the predict method on the machinelearning client, provide it with the model ID. The record is the unlabeled data that we would want to get a prediction for, and we would also provide the prediction endpoint that we got in the previous section.
Now remember, when we're making on demand predictions, we want to send one customer at a time, so that's why I've specified 0 as the index, to get the first customer out of our list of customers. So when the call to ml.predict finishes, we should have a response object that contains any information that Amazon has sent back to us. If there was a problem making the prediction, then we would get an error instead. Which would be handled in the exception handler. Once we have that response object, we can pull the prediction out of it. And once we have the prediction, we can pull out the predicted label, using prediction.get. So for this program, we'll just go ahead and print out the prediction information so that we can view it in PyCharm. But of course in a production system you can go ahead and do whatever you need, such as displaying some information to the user, or putting the customer into a queue of people to reach out to, to make the promotion to, and so on. But for now, we'll just print it out.
The prediction itself is a data structure of course. So I will use the json package to dump it out as json, so that we can read it. And then I've added a line that just prints out yes, if the predicted label is yes, otherwise no. So it just tells us what we think the customer's likely to say to our promotion based on the model that we've created. And now we're ready to give this a try by clicking the green play button. And it worked.
Amazon received our data, and sent us back a prediction. We can see the score was .19, and the predicted label is 0. So we did not think that this person is going to say yes. Our model predicts that our customer is likely to say no to our promotion. And Amazon also provides a few details about the model itself in this response. Now when we were looking at our batch prediction results, I recall that one of the early customers had a pretty high score. I think that was the fourth customer down, with index 3. So let's see if we can see that the model predicts that, using the online prediction. And we do. So the fourth customer down still has a predicted score of .77, and our model predicts that they will say yes. Because .77 is above the cutoff threshold that we configured for our model. So that's our simple Python program for getting on demand predictions from Amazon ml.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.