Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
Once we've created a datasource, we can use it to create a data model. There are several ways to do that, but we can do it right from the data insights page. So click on "Banking Promotion" and I'll use this drop-down here to select "Create an ML model." When we launch the model creation wizards from the data insight page, the data input is implied, so we don't need to do anything on the first page and Amazon will automatically take us to the second page to choose our ML model settings. Based on the datasource settings, Amazon already knows that the target is "y" and that we will be building a "Binary" classification model. We can choose a name if we like and I'll change it to Banking Promotion Model and next, we have a choice. We can let Amazon choose the rest of our settings for us by going with the "Default (Recommended)" option, but if we do that, we'll skip the next steps, three through five, and go straight to review. In many cases when starting out, this is what you will want to do.
However, I'm going to choose "Custom" because this will let me show you more of the options. There is no danger with choosing the custom option either. Each page will still be populated with the defaults generated by Amazon, so if I don't change anything, then no harm is done by peaking under the covers. So I will go ahead and click "Custom" and then click "Continue" to go to the recipe page.
Recipes are Amazon's way of allowing us to perform feature processing on our data. Feature processing takes our original data and makes it more meaningful. For example, you might break up a time stamp and do a variable that represents day-of-the-week, or month, or hour of the day. We can imagine that our banking data set was already processed this way, in order to create the day variable. This variable probably started out as a time stamp but was converted into a day-of-the-month.
And it's perfectly valid to do your feature transformation using Python or your language of choice, but Amazon also provides a set of common machine learning transformations that we can use to create recipes and we can see that Amazon has created a default recipe for us, on the right. The recipe is a JSON file with three parts: groups, assignments, and outputs. Groups allow you to give a name to a set of variables. Later, you can use the name group to specify one transform for the entire group, as has been done here with the numeric bars "QB_10." The transform is applied in the output section and it just means that "quantile_bin" will be applied to every member of the group: duration, previous and campaign. You don't actually have to transform your groups at all if you don't want to. You can just use them to create a common name for a set of variables and then use them later in the output sections, if that makes sense.
Assignments are all about giving a name to a transformation that you can use later, or simply for increased readability, and although Amazon hasn't created any assignments for us, we could apply a transformation and give the result a name and then use it later in the output area.
Finally, the output area controls what variables will actually be used in the learning process. You have to have at least one variable in the output area or else there's nothing for the machine learning algorithm to learn from. There are some default groups listed here in our default recipe, like "ALL_BINARY" and "ALL_CATEGORICAL. " Transformations can be directly applied here in the output area as we see with all these "quantile_bin" transformations. Again, you could've also done these transformations in the assignment area and given them a name and then just use the name down here.
The items in the output array are the only items that will be used in the machine learning model, so you can use it to filter the data, if there's some data you don't want, or create new features through transformation, or both. We'll go ahead and alter this default recipe by making a change of our own. I'm gonna go ahead and use an online JSON editor to make my changes, just to ensure that I don't get any accidental syntax errors while editing. Over on "jsoneditoronline.org" I can paste text into the left side and use this arrow to move it over to the right.
Now that it's on the right, I can make changes using the editing tools and then move it back over to the left when I'm done. So I'll start by adding a cartesian product transformation to create a new feature and I'll do that in the "assignment" section so that we can see how that works. I'll expand the assignment section and click this little box to access the "append" function. I'll need to give my assignment a name, which I'll call, "education_job" and a value. For the value, I'll specify a cartesian as the transformation, and then name the two variables that I would like to combine, which are education and job. When I'm done, I'll use the left arrow to move the updates back into the JSON. Next, I need to add the "education_job" assignment to the output list so that it will be considered when building the ML model. I'll expand the output list, use the menu to choose "append" and then enter the name "education_job" as a new valued output.
Once again, I'll use the left-pointing arrow to update the JSON. When I'm done making changes, I'll copy the JSON and then paste it back into the AWS console. I should have no syntax error because I used the online JSON editor and I can click the verify button to make sure that there's no other mistakes. This green message down here lets me know that my recipe is valid and so I can click "Continue" and go onto "Advanced settings." Although I have the opportunity to change these settings, I don't have to, so I'll keep these defaults and click "Continue."
Next, Amazon will ask us if we want to perform an evaluation of the model. We definitely do want to perform an evaluation. As discussed in previous lectures, building and testing the model is the only way to know if we have enough data and that we have the right data. So, I'll answer "Yes" to this question and I get a few more options when I do. I can change the name so I'll just change it to "Banking Promotion Model Evaluations" and next I need to tell Amazon how much data to use when training the model. The default is probably what you want to use, which is to hold out 70% of the data for training, leaving behind 30% of the data for testing. If you've already split your data between training and evaluation sets, then you would want to choose the other option and use 100% of this datasource as your training set. However, in our case, we haven't predetermined our split, and we'll let Amazon do that for us. I'll go ahead and click "Review" to take us to the last page of the wizard, to the review page. Like all Amazon wizards, we have one last chance to review the settings that we've chosen, go back to any page we want to and make changes. However, we like these settings the way they are, so we'll just scroll down to the bottom and click "Finish." Once we click "Finish," Amazon will take us to a model summary screen. As we can see, the model state is in "Pending." In a few moments, Amazon will start processing our data and building the model. Once it's done building the model, it'll perform the evaluation against the test data and when the evaluation is complete, we can come back and see how it performed.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.