Introduction to Amazon Machine Learning
The goal of this post is to introduce you to machine learning - and specifically Amazon Machine Learning - and help you understand how the cloud ca...Learn More
A common question in the medical field is:
Is it possible to distinguish one class of samples from another, based on some set of measurements?
Research investigating this and related medical questions have spurred innovation in medicine and the application of statistical methods and machine learning for decades. In this post, we’ll address how to answer these questions using highly available, scalable, and easy-to-use cloud computing services that are included in Amazon Web Services (AWS).
We’ll start by guiding you through using Amazon Machine Learning to classify medical tumor samples as benign or malignant. Then, we’ll explore other machine learning services and how they could be used to investigate medical questions.
This section investigates the medical question:
Is it possible to distinguish which breast mass samples are malignant given measurements from digital images of their cells?
The same question was asked by Dr. William Wolberg at the University of Wisconsin Hospital who created the data that is used in this section. During clinical trials, he extracted breast mass using a fine needle and took a variety of measurements of cell nuclei from magnified images similar to the following:
The diagnosis made by a pathologist is also recorded in the data.
The research aimed to find out if a computer could predict the same diagnosis in a way that is faster and less expensive. You can experiment with this yourself in a new hands-on lab on Cloud Academy, Diagnose Cancer with an Amazon Machine Learning Classifier
The raw data is hosted by the University of California, Irvine Machine Learning Repository. However, this guide uses a version that includes descriptive headings for each measurement field and a 0/1 encoding for the diagnosis.
Some examples of measurements taken of each nucleus include:
Because the images contain multiple cells, the data file uses average error, standard error, and a maximum of each group of measurements.
The diagnosis for a sample can be one of two classes: benign or malignant. In machine learning, this is referred to as binary classification. In AWS, you can use the Amazon Machine Learning (ML) service to build and train a predictive model for binary classification. To use Amazon ML, you typically perform the following four steps:
If the evaluation results aren’t satisfactory, you might adjust some settings in the model. The default settings usually provide good results and are a good place to start. The remainder of this section will briefly cover each of the four steps for the medical question at hand.
Amazon ML accesses data through data sources. A data source can reference data in Amazon Simple Storage Service (S3), Amazon’s Redshift data warehouse service, or Amazon Relational Database Service (RDS). All of these services are all HIPAA eligible services. For this experiment, S3 is the appropriate choice since the data exists in a file. All you need to do is upload the file to an S3 bucket where you have permission to create bucket policies:
You need permission to create bucket policies because Amazon ML creates a bucket policy for itself on your behalf to use the data.
When you access Amazon ML for the first time, you can launch a Standard setup wizard that steps you through creating a data source, model, and evaluation:
Following the setup wizard to create the data source, you:
In addition to referencing actual data, a data source gathers statistics about the data that are useful for building a model. Amazon’s elastically scalable compute infrastructure begins analyzing the data using multiple servers as soon as the data source is created. After processing completes, you can inspect the statistics for all the attributes in the data source. The medical data primarily contains numerical attributes (the measurements taken from images). Correlation to target, range, mean, median, maximum, minimum, and distributions are calculated for numerical attributes:
The distributions can be visualized using different bin widths automatically chosen by Amazon ML:
The Amazon ML wizard automatically starts creating a model based on the data source you created. The data source target attribute determines the type of model that will be created:
Because the diagnosis column is binary, the created model will be a binary classification model.
In the wizard, use the Default training and evaluation settings as a starting point for your model:
The default settings will automatically generate a recipe for transforming the data source columns to features of the model. An example of transformation could be normalizing numerical columns to have zero mean and a standard deviation of one, or grouping ranges of values into bins instead of using the column values directly. The statistics that Amazon ML gathers for the data source are used to automatically determine which recipe is most likely to work well. The default settings also split the data into training and test sets for evaluation.
You can use custom settings for model training and evaluation. This is useful when you want more control over the model. For example, this could be helpful if the default settings don’t produce satisfactory results or if you know that a certain transformation or training parameter will work well. Custom settings allow you to control several training parameters including regularization type and amount to prevent overfitting the model to the training data. You can also specify an entirely separate data source for evaluation. The ability to specify separate data sources for validation allows you to perform more sophisticated validation schemes like k-fold cross-validation.
Completing the wizard will kick off model training. Once model training is complete, the evaluation will automatically begin. The evaluation uses the independent test set data that was not used during training. The model generates predictions for the unseen test data and compares the predictions to the actual diagnosis values
Once a model evaluation finishes, you can explore how the model performs in the evaluations view. Overall model performance is measured by the Area Under the Curve (AUC). AUC is a single metric that measures the overall accuracy of the model. The model combines measurements in the data in this experiment to achieve exceptionally good performance:
In most cases, an AUC around 0.7 or above is considered good.
AUC can’t show you the entire performance picture. The binary classification model outputs a number between zero and one where one is interpreted as predicted to be malignant and zero is interpreted as predicted to be benign. For values between zero and one, the interpretation depends on the setting of a threshold. Any value above the threshold is predicted as malignant and anything below is benign. The default value is 0.5. If the threshold of the model is poorly calibrated, a high AUC value means almost nothing.
The model evaluation includes a chart that is helpful for understanding the issue:
The threshold impacts the number of false positive (predicting positive when the actual diagnosis is negative) and false negative predictions. In general, a higher threshold reduces the number of false positives, while a lower threshold reduces the number of false negatives. In the medical domain, the cost of failing to diagnose a malignant tumor is greater than the cost of diagnosing a benign tumor as malignant. This can influence the threshold setting. Having Amazon ML compute the values for whatever threshold setting you choose and display the values in the visualization makes it easy to understand and difficult to make mistakes in interpreting the results.
With a trained model in Amazon ML, you can make predictions in real time as well as in batches. For each type of prediction, you can use the Amazon ML console in your browser or an application programming interface (API) to programmatically make predictions. As an example, you can use the Try real-time predictions menu option in the Amazon ML console to make individual real-time predictions from inside your browser. From there, you can Paste a record to enter measurement values for each of the model attributes and have the model make a prediction:
You can use the following measurement sample:
The predictedLabel of 0 corresponds to a benign prediction. You can also see the predicted Scores, which shows the score computed before passing the threshold. Scores closer to zero or one are more confident predictions.
This concludes the guide on binary classification in AWS. You have seen how Amazon Machine Learning could have been used to easily perform the pioneering research that enabled a now common medical practice diagnosis procedure. However, there is much more to machine learning in AWS than binary classification in the Amazon Machine Learning service. The following section describes some of the services that can be used for medical research.
AWS provides a secure platform for research involving large amounts of data and collaborators across multiple sites. The elastic compute infrastructure allows you to scale to meet time constraints regardless of the size of the workload. Automated compliance services also enforce rules and policies required for handling sensitive data. In addition to these benefits for medical research, this section highlights a few AWS services that are useful for machine learning in the medical domain.
As was briefly mentioned in the section above, multi-class classification and multi-variable linear regression are other types of models available in Amazon ML. Multi-class classification can play an important role when considering a broader research question. For example, given a set of symptoms, what is the most likely diagnosis? Regression models are useful for understanding the impact of multiple factors on an independent variable. This could be used to understand how age, weight, alcohol consumption, and amount of exercise influence blood pressure.
A hands-on lab in Cloud Academy walks you through the multi-class classification capabilities of Amazon ML. The lab uses inertial sensor measurements from a subject’s mobile phone to recognize the subject’s activity from a set of six possible activities. A lab to demonstrate the regression model capabilities in Amazon ML is currently in the works, so keep an eye on the latest AWS labs on Cloud Academy.
Amazon maintains a set of Amazon Machine Images (AMIs) that allow you to launch servers, referred to as instances in AWS parlance, that have all of the most common machine learning frameworks pre-installed and ready to use. These AMIs are called Amazon Deep Learning AMIs.
Some of the frameworks included in the AMIs are Apache MXNet, the TensorFlow framework open-sourced by Google, and Caffe2 open-sourced by Facebook. The Amazon Deep Learning AMIs and the included frameworks are the subject of an entire course on Cloud Academy that will be released this month. A post on the AWS blog covers the story of a medical startup using AWS and the Amazon Deep Learning AMI to boost the early cancer detection rates by automatically processing CT scan imagery with advanced computer vision algorithms.
Amazon has a wide variety of instance types to choose from to hit the right price point for your application. Some instance types have graphics processing units (GPUs) attached to greatly reduce the time required to train deep learning models. You will be abl to experience the benefits of GPUs for machine learning first hand in a new hands-on lab on Cloud Academy. This lab will be available soon as part of our new learning path on Machine Learning in AWS (see more below). In addition to launching individual instances, Amazon provides a template to automatically provision a distributed computing cluster of deep learning instances:
This can save you from a lot of manual set up when a single instance isn’t enough for your machine learning needs.
SageMaker fills a gap between Amazon ML and the Deep Learning AMIs. While Amazon ML is extremely easy to use, it has a limited number of models available in Amazon ML. On the other hand, Deep Learning AMIs give you complete freedom in training and modeling but require infrastructure maintenance and intimate data science knowledge. SageMaker provides almost as much flexibility as Deep Learning AMIs but in an easy to use, fully managed service.
SageMaker provides a fully managed service for machine learning from data exploration to model hosting. SageMaker can access data in S3, and it includes many common machine learning algorithms that are built-in. You also have the flexibility to fully customize algorithms using the Apache MXNet or Tensorflow frameworks. Once you build your model, using SageMaker you can train it with a single click that automatically scales up compute infrastructure to easily handle up to petabytes of data. The following infographic summarizes how SageMaker works:
While not a cloud service per se, a notable mention in this list is the AWS Machine Learning Research Awards program. Machine learning research conducted at a university can be eligible for awards that include funding, AWS credits, access to AWS training resources, and an invitation to present your work at a seminar at AWS headquarters!
AWS has a variety of tools available for medical research and applications. This post focused on some of the machine learning services that are available:
If this post has sparked a machine learning research question in your mind, you can consider applying for an AWS Machine Learning Research Award. For a comprehensive look at the entire machine learning ecosystem in AWS, watch the Machine Learning on AWS Learning Path on Cloud Academy.
In true AWS style, a number of new features and services were announced yesterday, the day before the official start of re:Invent.Three of these announcements were related to Amazon S3 which included: S3 Intelligent Tiering (A new storage class) Batch Operations for Object M...
Is it possible to create an S3 FTP file backup/transfer solution, minimizing associated file storage and capacity planning administration headache?FTP (File Transfer Protocol) is a fast and convenient way to transfer large files over the Internet. You might, at some point, have conf...
Amazon S3 is the most common storage options for many organizations, being object storage it is used for a wide variety of data types, from the smallest objects to huge datasets. All in all, Amazon S3 is a great service to store a wide scope of data types in a highly available and resil...
Some of 2017’s largest data breaches involved unprotected Amazon Simple Storage (S3) buckets that left millions of customer data records exposed to the public. The problem wasn’t the technology, but administrators who improperly configured the security settings.For cloud teams in ch...
From the 22 new features released by AWS today at re:invent 2017, Amazon Rekognition Video stood out to me as the interesting “quiet achiever” I want to tell you about.Amazon Rekognition Video brings object and facial recognition to live and on-demand video content. With this innova...
Who is Athena again? Athena is the Greek goddess of wisdom, craft, and war. (But at least she had a calm temperament, and only fought for a just cause!) This post is about Amazon Athena and about using Amazon Athena to query S3 data for CloudTrail logs, however, and I trust it will brin...
New expanded content showing all three AWS Serverless posts in one article. This is a detailed look at the components of AWS Serverless Architecture and how anyone can make the most of it. Because of the complexity of the subject, this post has been subdivided into 3 sections, each with...
Learn about Bucket Policies and ways of implementing Access Control Lists (ACLs) to restrict/open your Amazon S3 buckets and objects to the Public and other AWS users.Follow along and learn ways of ensuring the public only access for your S3 Bucket Origin via a valid CloudFront reque...
Riak CS is an open source cloud storage technology compatible with Amazon S3 and Openstack Swift. Discover why more and more companies are using it.Riak CS may not be the best-known cloud storage technology right now, but it's definitely worthy of our attention. This post isn't meant ...
Let's discuss VPC Endpoint's value, common use cases, and how to get it up and running with the AWS CLI.Last month Amazon Web Services introduced VPC Endpoint for Amazon S3. In this article, I am going to explain exactly what this means, how it will change - and improve - the way AWS ...
Amazon S3 vs Amazon Glacier: which AWS storage tool should you use?When you set out to design your first AWS (Amazon Web Services) hosted application, you will need to consider the possibility of data loss.While you may have designed a highly resilient and durable solution, this w...
In a previous post, we discussed the top 5 deployment tools for AWS. Out of them, AWS CodeDeploy is a tool which is especially designed for AWS. It is a new service by Amazon Web Services which was launched during the Re:Invent 2014 conference held in Las Vegas last November. The primar...