Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
In this demo, we'll set up our first Amazon ML datasource. Once we've logged into the Amazon AWS console, we'll set up our ML datasource. In a previous demo, we prepared a data file for use with Amazon ML and uploaded it to Amazon S3. Now we'll get that file ready for use within the Amazon ML system.
The Amazon Machine Learning web service is only available in North Virginia right now, so start by making sure that you are accessing the North Virginia data center through the console. Next, find the analytics section and pick Machine Learning. If you haven't used the Amazon Machine Learning before or if you have no data sets or models, you'll be presented with the welcome screen, so click on Getting Started to start. The standard set up option is a step-by-step guide for creating your first ML model.
Since I'll be walking you through your first ML model, we'll just go straight to the dashboard so we can choose where to go and when. So click View Dashboard. The Amazon Machine Learning dashboard is a pretty typical AWS dashboard. It allows us to manipulate existing objects, which we don't have yet, or create new objects.
By clicking on the blue Create New button, we can see what types of objects we can create. We have the choice of creating a datasource and ML model together, or datasource and ML model separately. We can create an evaluation or a batch prediction. For now, we'll start by creating a datasource on its own. Our two choices for input data are S3 or Red Shift. In our previous demo, we put our data in an S3 bucket, so we'll leave that selected as S3, and we'll type in the name of that bucket, which was CA, for Cloud Academy, ML demo. And then the final name within there is suggested to us as Cleanup CSV. Since that's our only file, Amazon guessed right. And once we choose the file, we have a chance to click this Verify button and see what Amazon makes of it.
Now, the first thing that Amazon does is check to see that it has access to that, to the file in the S3 bucket. Since I've already practiced this schema before, I've already granted that permission, but if it's your first time accessing a file in a bucket, the console will ask you for permission to access the bucket, and so you just wanna click Okay when that happens. After it has access to the file, it validates the file basically by scanning through it and ensuring that it meets all the data format requirements.
Now, we took care to make sure that we met all those requirements in a previous demo and so now our validation, of course, succeeds here, but if we hadn't taken care to make sure that our file met these validation rules, it would fail here and we'd have to go back to preparing our file again, but since our file is good, we can click continue to the form schema next.
On the schema definition screen, we can see that Amazon has correctly guessed the data types for input data file. Now, the only problem that I see here is that all the variable names are generic: Borrow 1, Borrow 2, and so on. There's no rule that says that the first mine of a CSV needs to have common names in it, so Amazon defaults to just assuming that everything in the CSV is data.
In our case, the first line is column names so we need to let Amazon know by clicking yes here. And when we do that Amazon re-scans the data and regenerates a schema, and it comes up with basically the same information, except this time, it's added the column names correctly. Now that our variables have proper names, we can examine each of them by navigating through these pages. And then when we're ready, we can click continue to identify the target.
Now, the first thing we need to do before identifying a target is to let Amazon know that we do intend to use this data to create or evaluate an ML model. We do that by clicking Yes here. After we click Yes, we'll be shown all the variables that are in our original data file, and we need to indicate to Amazon which one of these is the target. Remember, target is just a synonym for label, and as we remember from our last demo, our label was called Y, so we need to navigate through here until we find the variable called Y, and then select it on its radio button. And when we select it and Amazon lets us know that, because Y is a binary variable, we'll be using binary classification when we use this data set to create models. That's okay. That's what we expected, so we'll go ahead and click Continue.
Now, on this screen, the Row ID screen, we have the opportunity to create an optional row identifier. A row identifier is just an additional piece of data that helps us uniquely identify a certain piece of data within the data set. So if you had something like an account ID or a person ID or something like that in your data set or product ID or promo code, and you chose it as your row identifier, then Amazon would include that in the output file when it makes predictions. But in our case, we don't have anything like that.
Each row of our data is just features, and none of them is uniquely identified by something like an account number. So instead of saying Yes here, we'll actually choose No, and this is optional so please go ahead and click Review. In the last step, we can review the choices we made. We see some basic information about a file coming in, schema, etc. Now, I'm gonna make sure that we pick the correct target value Y, and that we're fine with the type of classification. If all this is good, which it is, we can click Finish and commit to these settings.
When we click Finish, we're taken to the data summary screen for the datasource. So the datasource—it kind of exists now in the sense that Amazon knows about it, but its status is pending—Amazon has not yet processed it, so there's not a whole lot to see here yet. It can take a while for Amazon to process our file, and processing at this stage does not involve creating the ML model.
Amazon is actually computing descriptive statistics regarding our data and determining a level of correlation between the data variables and the target. Now, some of this information will be usable to Amazon later when it's building the Amazon Machine Learning model, but right now it's just basically getting a handle on the data itself, and this can take a while, so I'll go ahead and pause here and come back when it's done creating the datasource.
Okay, now that we're back, we can now go ahead and click on our datasources here and get an overview of the dashboard, and we can see that our datasource is finished. So now if we click on it to return back to the data summary screen, we can see that once it has reached the complete status, Amazon provides something that they call data insights. The data insights give us basic information here like the datasource name, which we can change to something that makes a little bit more sense, like banking promotion, and check there. And that's nice, but what else can we do here?
Well, we can see down here in the processing information section the number of records that we were able to import and that we would see how many failed to process if any didn't comply, and that's zero because our data was clean. That's kind of the most interesting thing on this page to me is this little target visualization. This little icon actually gives us a true visualization of our target, which you'll remember is that Y variable, the label which indicates whether or not the customer was interested in the banking promotion being offered.
And this little bar chart here is an actual visualization of our data. It's not a generic icon, and if we can click on it, we can see the same visualization in more detail, and we can see that this is the binary tribute Y and that the majority of customers were not interested in the banking promotion. If you remember that in our previous demo we turned every no, everyone who said No, we turned into a zero, and at the same time, we turned everyone who said yes into a one, so the great majority of people were not interested in this banking promotion. And we can see that similar types of data about other variables as well.
We can see that Y was our only data variable so we just look at that, but we can see that categorical variables and we should have several of those: contact, how the contact was made, type of education, and so on. We can see for each of these categorical variables, what was its correlation to the target. We can see that whatever this P outcome was was most correlated as far as categories go, and other descriptive statistics, like number of unique variables, most frequent, least frequent. And again over here, actual visualizations for each one of these categories.
So we can go ahead and nicely click on Job, and in the same way, we can see what kind of jobs there were and how many people in the data set had that type of job. We'll do the same thing for numeric. We should have a view of those, age, balance, days of the week, and so on. And these, again, we have our distributions. It's a little bit more descriptive statistics, measures of central tendency, which does make more sense for numerics and categories.
We can see that the range of days is actually 1 through 31, and that's probably more suited to being a categorical attribute rather than numerical, so in that case, I meant we guessed wrong back in the beginning when we just let Amazon automatically choose the schema. We should have changed that. And we can go back and do that if we need to. For now, we'll just keep exploring this data as it is. Again, we see a correlation of the target here. And again, we can evaluate the correlation target here and make sure that that makes sense.
Finally, we'd have the same type of descriptive attributes for text attributes, but in this case, our data didn't come with any text attributes. So the data inside the page provides you with visualizations like these charts and other information like the descriptive statistics that you can use to provide a simple sanity check for your data set. For example, we saw that day of the month variable was categorized as numeric. You probably wanna go back and change it to the categorical. You might also wanna check for distributions to see that they make sense for you.
So when we're looking at something like age, maybe we're expecting a bell curve, but we might see that it's skewed in one way or another. In our case, we see that we get a pretty decent curve shape, but maybe we expected that peak to be on the 40, and it's not, and we know that it should be the 40 because we know that in general, our customers group around that range. So we would know from looking at that, this data is skewed and that we then need to remove some of the younger people from the data set or add some more data until we get the distribution that we're expecting. So we did see that there was an issue with the day of the month variable, so I'll go back and fix that offline. When we come back, we'll see what the next step is for working with our datasource.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.