What is Amazon Machine Learning and how does it work

“Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology.”


  • I’ve published a new hands-on lab on Cloud Academy! You can give it a try for free and start practicing with Amazon Machine Learning on a real AWS environment.
  • Cloud Academy has now released a full course on Amazon Machine Learning that covers everything from basic principles to a practical demo where both batch and real-time predictions are generated.
  • In October, I published a post on Amazon Mechanical Turk: help for building your Machine Learning datasets.

After using AWS Machine Learning for a few hours I can definitely agree with this definition, although I still feel that too many developers have no idea what they could use machine learning for, as they lack the mathematical background to really grasp its concepts.

Here I would like to share my personal experience with this amazing technology, introduce some of the most important, and sometimes misleading, concepts of machine learning, and give this new AWS service a try with an open dataset in order to train and use a real-world AWS Machine Learning model.

Luckily, AWS has done a great job in creating documentation that makes it easy for anyone to understand what machine learning is, when it can be used, and what you need to build a useful model.

You should check out the official AWS tutorial and its ready-to-use dataset. However, even if you follow through each step and produce a promising model, you may still feel like you’re not yet ready to create your own.

AWS Machine Learning tips

In my personal experience, the most crucial and time-consuming part of the job is defining the problem and building a meaningful dataset, which actually means:

  1. Making sure you know what you are going to classify/predict
  2. Collecting as much data as you can about the context without making too many assumptions on what is and isn’t relevant

The first point may seem trivial, but it turns out that not every problem can be solved with machine learning, even AWS Machine Learning. Therefore, you will need to understand whether your scenario fits or not.

The second point is important as well, since you will often discover unpredictable correlations between your input data (or “features”) and the target value (i.e. column you are trying to classify or predict).

You might decide to discard some input features in advance and somehow, inadvertently, decrease your model’s accuracy. On the other hand, deciding to keep the wrong column might expose your model to overfitting during the training and therefore weaken your new predictions.

For example, let’s say that you are trying to predict whether your registered users will pay for your product or not, and you include their “gender” field in the model. If your current dataset mostly contains data about male users, since very few females have signed up, you might end up with an always-negative prediction for every new female user, even though it’s not actually the case.

In a few words, overfitting simply means creating a model that is too specific to your current dataset and will not behave well with new data. Naturally, this is something you want to avoid.

That’s why you should always plan an evaluation phase where you split your dataset into two segments. The first one will be used to train the model. Then, you can test the model against the second segment of data and see how it behaves. A good model will be able to correctly predict new values. And that’s the magic we want!

Here are a few use cases, ranging from typical start-up requirements to more advanced scenarios:

  • Predict whether a given user will become a paying customer based on her activities during the first day/week/month.
  • Detect spammers, fake users, or bots in your system based on website activity records.
  • Classify a song genre (rock, blues, metal, etc), based only on signal-level features.
  • Recognize a character from a plain image (also known as OCR).
  • Detect, based on accelerometer and gyroscope signals, whether a mobile device is standing still, moving (upstairs or downstairs), or how it is positioned (vertically or horizontally), etc.

All of these problems share a common assumption: You need to predict something “completely unknown” at runtime, but you have enough Ground Truth data (i.e. labeled records) and computing power to let Machine Learning solve the problem for you.

What AWS Machine Learning will do for your Organization

If you have ever built a classification model yourself, you know that you should carefully choose your model type based on your specific use case. AWS Machine Learning is quite powerful in this regard because it automatically trains and tests a lot of complex models, tuned with different parameters so that the best one will be chosen for the final evaluation. In fact, this will generally be the one you would usually end up figuring out by hand through trial and error.

Of course, AWS machine Learning will also handle all of your input normalization, dataset splitting, and model evaluation work. In fact, as long as you provide a valid data source, AWS Machine Learning can solve most of your low-level problems.

AWS ML Datasource Statistics

Even before training and evaluating your model, you can analyze your data source to better understand the often hidden correlations within your data. Indeed, in the Datasource attributes section, you can find the values distribution of all your columns, and clearly see which of them contribute more in defining your target value.

For example, you will probably generate a completely useless model if there is no correlation at all between features and target. So, before spending your time and money in the actual model training and evaluation task, you always want to have a look at these statistics. The whole process can take up to 30 minutes, even without considering the S3 upload!

A real use case: Human Activity Recognition

There are plenty of open datasets for machine learning provided by public institutions such as University of California, MLdata.org, and deeplearning.net. One of them might be a good starting point for your use case. You will need to adapt their input format to the kind of simple csv file AWS Machine Learning expects and understand how the input features have been computed so that you can actually use the model with your own online data to obtain predictions.

I found a very interesting dataset for HAR (Human Activity Recognition) based on smartphone sensors data. It is freely available hereThis dataset contains more than 10,000 records, each defined by 560 features and one manually labeled target column, which can take one of the following values:

  • 1 = walking
  • 2 = walking upstairs
  • 3 = walking downstairs
  • 4 = sitting
  • 5 = standing
  • 6 = lying down

The 560 features columns are the input data of our model and represent the time and frequency domain variables obtained by the accelerometer and gyroscope signals. Since the target column can take more than two values, this is known as a multi-class problem (rather than a simpler binary problem). You can find more information about these values in the downloadable .zip file.

I can’t even imagine how mobile applications may actually use this kind of data. Perhaps they want to gain insight into app usage. Or perhaps they can track daily activity patterns in order to integrate them with your fitness daily report.

Preparing the Datasource csv file

This dataset was not formatted the way AWS Machine Learning expected. The typical notation in the ML field indicates the input matrix as X and the output labels as Y. Also, the usual 70/30 dataset split has already been performed by the dataset authors (you will find four files in total), but in our case, AWS Machine Learning will do all of that for us, so we want to upload the whole set as one single csv file.

HAR raw dataset

I coded a tiny python script to convert the four matrix-formatted files into a single comma-separated file. This will be the input data of our Datasource.

Luckily, as a Datasource input, you can either use a single S3 file or a set of files with the same schema, so I decided to split my large file (~90MB) into smaller files of 1,000 records each and uploaded them to S3. Note that I couldn’t have uploaded the raw dataset files directly, as their schema is not coherent and not comma-separated.

AWS Machine Learning - Datasource creation

Data manipulation and massaging is a typical step of your pre-training phase. The data might come from your database or your analytics data warehouse. You will need to format and normalize it, and sometimes create complex features (via aggregation or composition) to improve your results.

Training and evaluating the model

The process of creating your datasource, training, and evaluating your model is fairly painless with AWS Machine Learning, as long as your input data is well formatted. Don’t worry: If it’s not, you will receive an error during the Datasource creation. Everything will automatically stop if more than 10K of invalid records are detected.

In our case, everything should go pretty smoothly, even if kind of slowly, until the model is created and the first evaluation is available. At this point, it will show your model’s F1 score. The F1 score is an evaluation metric (0 to 1) that takes into account both precision and recall. With the given dataset, I got a score of 0.92, which is pretty good.

AWS Machine Learning - Model Evaluation

In addition to the score, you are also shown an evaluation matrix. The evaluation matrix is a graphical representation of your model behavior with the testing set. If your model works fine, you should find a diagonal pattern, meaning that the records belonging to the N class have been correctly classified as N, at least most of the time.

In a general scenario, some classes are easier to guess than others. In our case, the classes 1, 2, and 3 (the three walking classes) are pretty similar to each other, as well as 4 and 5 (sitting and standing). From the table, I can see that the most difficult class to guess is 5 (standing): in 13% of cases it will be classified as 4 (sitting). On the other hand, the class 2 (walking upstairs) is easy to guess (with almost 95% precision), but it might be wrongly classified as 1, 3, 4, or 5.

Based on this data, you might decide to enrich your dataset. for example we could add more “standing” records to help the model distinguish it from “sitting.” Or we might even find out that our data is wrong or biased by some experimental assumptions, in which case we’ll need to come up with new ideas or solutions to improve the data (i.e. “how tall was the chair used to record sitting positions based on the average person’s height?” etc).

How to use the model in your code

Now we have our machine learning model up and running and we want to use it on a real-world app. In this specific case, we would need to sit down and study how those 560 input features have been computed, code the same into our mobile app, and then call our AWS Machine Learning model to obtain an online prediction for the given record.

In order to simplify this demo, let’s assume that we have already computed the features vector, we’re using python on our server, and we have installed the well known boto3 library.

All we need to obtain a new prediction is the Model ID and its Prediction Endpoint. First of all, we need to enable the ML model for online predictions. We simply click on “Enable real time predictions” and wait for the endpoint to be available, which we will retrieve via API.

AWS Machine Learning - Enable Online Prediction

Note: Since I had more than 500 input columns I didn’t really take the time to name all of them, but when you have only a few input features it would definitely be a good idea. That way, you’ll avoid having to deal with meaningless input names such as “Var001”, “Var002.” In my python script below, I am reading the features record from a local file and generating names based on the column index (you can find the full commented code and the record.csv file here).

The record is passed to our model and evaluated in real time (synchronously). You will obtain a Prediction object.

The field we need is predictedLabel. It represents the classified result based on our input. Note that we also get a probability measure for each class. In this case, the predicted class has a greater than 99% probability of being correct. These statistics could be incredibly useful even when the given prediction is not reliable enough. In those cases, we might inspect the other classes’ degree of reliability and decide to switch to our own prediction, based on the context assumptions.

An alternative solution could be storing a larger set of records (i.e. one minute’s worth of our sensors’ signals) and call the create_batch_prediction method. This API resource expects a set of observations as input and will asynchronously generate one prediction for each record into a given S3 bucket.

That’s it! You can use the predicted value to provide real-time feedback on the device, or store it and use it to generate your insights, etc. I am not focusing on the specific software architecture, as this will work just as well using a wide range of alternative profiles as long as you use the mentioned API correctly.

Is your Team ready for AWS Machine Learning?

While there are a million use cases with datasets unique to a variety of specific contexts,  AWS Machine Learning successfully manages the process to allow you to focus just on your data, without wasting your time trying tons of models and dealing with boring math.

I am personally very curious to see how this service will evolve and how AWS users will exploit its features.

Moreover, I would like to see how AWS will handle another typical need: Keeping the model updated with new observations. At the moment this is quite painful, as you would need to upload a brand new source to S3 and go through the whole training/testing process every time, ending up with N models, N evaluations, and N*3 data sources on your AWS Machine Learning dashboard.

Have you found any useful datasets? How are you going to use AWS Machine Learning? Let us know what you think and how we can help you improve your models.

In the meantime, I’d like to invite you to Cloud Academy to try our new hands-on Lab for free. You will be able to practice with the above-mentioned concepts and explore Amazon ML on the real AWS Console.

Start Lab now

  • JeffWeakley

    I signed up for Cloud Academy trial and went through your tutorial (from the blog update). Very interested in the idea behind AWS Machine Learning. But I got stuck on the 4th part where you had to set up the S3, maybe it’s just me, but I found it confusing. Never got it to work, so I just moved on and read the content. Granted, I wasn’t able to do what I wanted, but I’m not clear on what models it ran and how to interpret the output.????

    • Hi Jeff, I am sorry to hear that you got stuck during the Datasource creation.

      I didn’t focus much on explaining how the AWS Console and S3 work or how you would normally upload your dataset, as I already uploaded the needed files into a publicly accessible S3 bucket for you.

      Can you please help me find out what went wrong? Have you been shown any error message during the source validation?

      • JeffWeakley

        Thanks Alex. I’m in your tutorial trying it again. So I chose the N.Virginia region. but now I don’t know what to do with s3://amazon-machine-learning/HAR/. I assumed maybe I should click on S3, but when I do, it opens another tab with very little on it except… “You don’t have permission to use the Amazon S3 Console”. So I don’t know how to proceed and don’t see any hints to move forward with actually using the system????

        • Sorry about the misleading instructions, you don’t have to use S3 at all.
          You want to select the Amazon Machine Learning service and launch the Data source and Model creation process.

          I’m going to add a clear screenshot for this step asap. Let me know if you manage to create your Data source, using my S3 files as input.

  • David H

    One thing that I was looking for is some more insights about how individual variables influence the target.

    Seems to me that the AWS Machine Learning is great as a “black box” prediction engine, but does not actually give insights on individual variables – unless there is something I missed.

    • Hi David, yes I think you actually missed it.

      Let me quote myself:

      Even before training and evaluating your model, you can analyze your data source to better understand the often hidden correlations within your data. Indeed, in the Datasource attributes section, you can find the values distribution of all your columns, and clearly see which of them contribute more in defining your target value.

      As I mentioned, you can find the values distribution for all your variables and their individual correlation with your target variable as well.

      Interestingly, you can do this before training a model, so that you won’t waste your time training a useless model, in case no correlation appears at all.

  • Hanan Shteingart

    I would like to comment that variables can be uncorrected to target, yet informative. Think of z=x*y where x=+-1 and y=+-1 with p(x=1)=1/2 and p(y=1)=1/2 and x, y are independent. In this case, z has zero correlation with both x and y, yet it can be perfectly predicted from x, y.

    • Thank you for the observation, @hananshteingart:disqus.
      In this case, a linear prediction model would never fit since your target is a nonlinear combination of independent variables. Indeed, AmazonML doesn’t generate an accurate model at all, given the dataset you suggested: I quickly gave it a try, hoping that it would somehow fit nonlinear models as well.

      In order to workaround this scenario – whenever there seem to be no correlation between target and inputs – I would suggest to augment your dataset with some nonlinear combinations of the input variables (e.g. x^2, y^2 and xy). In this particular case, the input variable xy would precisely match our target and the model would become a trivial prediction formula, with 100% precision.

      In the general cases of multi-class models and regression models though, the situation you described is highly unlikely and AmazonML will behave pretty well. It would be great if AmazonML could detect these cases and automatically use nonlinear models or augment your dataset.

    • Alex Ingerman

      Hi Hanan –

      My name is Alex, I am the product manager for Amazon Machine Learning. I wanted to chime in here to mention that Alex Casalboni is exactly right: the way to deal with non-linearly correlated input variables in Amazon ML (and other linear learning algorithms) is by creating new variables that model the feature interactions.

      Amazon ML has a couple of built-in data transformers that make creation of non-linear “derived” variables easy. There currently isn’t one for multiplying two numbers together (I will keep this scenario in mind!). However, you can:

      1. Divide numerical values into bins, and learn a value for each bin, with the Quantile Binning transformation.
      2. Capture interactions between categorical variables with the Cartesian Product transformation.
      3. Capture values for individual tokens (words or group of words) in text variables, using the N-Gram and Orthogonal Sparse Bigram transformation.

      You can find more information and example use for all of these data transformers in the Amazon ML developer guide: http://docs.aws.amazon.com/machine-learning/latest/dg/data-transformations-reference.html

      Finally, thank you for the feedback about using non-linear ML models. It is something that we certainly think about and evaluate for future releases, although I don’t have any specific plans to share at this time.


  • according to the comments, seems like the emphasized point “trains and tests a lot of complex models” is not really true.

    • Hi @zh012:disqus, you are right. The interpretation of “complex” in that sentence was quite subjective. In practice, at least for now, they will be only linear models, trained and tuned with varying parameters, and AmazonML will choose the best one.
      This behaviour is completely transparent to the developer and might eventually improve over time, involving non-linear and more complex models.