Skip to main content

Amazon Mechanical Turk: Help for Building Your Machine Learning Datasets

How to use Mechanical Turk in combination with Amazon ML for dataset labelling

Whether you build your own machine learning models in the Cloud or using complex mathematical tools, one of the most expensive and time consuming part of building your model is likely to be generating a high-quality dataset.

Sometimes you already have a large amount of historical data and a precise ground truth knowledge about each data point, in which case your dataset is already labelled and all you need to do is clean, normalize, sub-sample, analyze, and train a model, and then iterate until you achieve a good evaluation.

But more often, all you have is a big bucket of raw unlabelled data and the process of manually building a consistent ground truth might be the most painful phase of your machine learning workflow. Some of these scenarios are well covered by companies and services that provide subject matter expertise about your specific context (linguistics, semantics, statistics, etc), usually at a very high cost. Other contexts, for example in the case of multimedia annotations, are way harder to handle, and it turns out that crowdsourcing might be a great way to cut down both costs and time.

What is Amazon Mechanical Turk?

Mechanical Turk – or MTurk – is a crowdsourcing marketplace where you (as a Requester) can publish and coordinate a wide set of Human Intelligence Tasks (HITs), such as classification, tagging, surveys, and transcriptions. Other users (as Workers) can choose your tasks and earn a small amount of money for each completed task.

The platform provides useful tools to accurately describe your task, specify consensus rules, and the amount you will spend for each item. Roughly, considering a $0.30 reward for each task and only one submission for each item, you could label a 1,000 record-dataset for as little as $300 (plus fees) in a few hours. This might just be cheap, fast, and accurate enough.

In case your task is particularly tough, you can raise the number of submissions to two and eventually lower the reward to $0.20, resulting in a total cost of $400, and so on until you find the best trade off between quality and cost. As a general rule, one well-rewarded task usually brings more quality than two cheap ones.

A real-world labelling example

Let’s consider a simple use case. Suppose you want to understand whether your website users have uploaded a good-looking profile picture or something else (i.e. an abstract avatar, a landscape, a group picture, etc). This might make sense if your website is a hiring platform, or some kind of app where mutual trust and real human interactions are important elements. Of course there are plenty of “as-a-service” solutions out there that might also help you for this kind of project, but this is just an artificial example.

First of all, you’ll have to sign up on the official MTurk website and create a new project. The platform provides a useful set of preconfigured tasks. In our case we can select “Categorization“.

Mechanical Turk Project Start a New Project
Then we need to create a list of possible categories, optionally containing sub-categories. For our classification problem we will be totally OK with a binary classifier (i.e. “good profile picture” or “bad profile picture”), but since we are paying for the task we’d better retrieve as much data as possible. Therefore I defined a short list of categories so that we will have the flexibility of choosing which “good or bad” category afterwards.
Mechanical Turk Possible Categories
The next step is to describe your task and, optionally, provide additional information (like real examples or doubtful cases), so your workers will know what each category should include or exclude. The “general instructions” section is very important as well, as it should attract high-quality workers and accurately define the context, but preferably without being too verbose.
Mechanical Turk Instructions & Criteria
At the end of the task configuration phase, you can either upload a CSV file or use the Mechanical Turk API to provide the items to classify. In the case of images, you can only provide a public URL that will be served to a worker along with your task description and any additional fields you set as visible. I uploaded a simple CSV file with 3 rows, each one containing only a UserID (hidden) and an ImageURL.

Finally, you are shown a checkout preview where you can choose how much each single task will cost and how many times it should be processed to find consensus.
Mechanical Turk Cost Estimation
As soon as you confirm these options and proceed with the payment, your tasks will start being served until each record of your dataset is classified.

How to build a model from Mechanical Turk results

Amazon Mechanical Turk will notify you when your results are ready and you will finally have a labelled dataset. In some cases, a few records might not have achieved any consensus, so could either improve your task instructions or, if the remaining dataset is big and statistically distributed enough to generate a useful model, simply discard them.

Our next step will be to upload our labelled dataset into Amazon Machine Learning, create a DataSource, and go through the model training and evaluation phases.

But how do you classify images on Amazon Machine Learning?

Unfortunately, AmazonML doesn’t yet provide any high-level classification tools for multimedia objects like images, audio, or video. Hopefully they will add this kind of functionality soon, but until then you will have to take care of everything related to the features extraction process. Of course you can’t just give AmazonML a public URL or a binary string, so you will need to add some complexity to your dataset.

Generally speaking, each multimedia classification problem might need different features depending on which kind of classification you are trying to achieve (i.e. is color important? maybe shapes are more relevant?). In our case, I would say that both color and shape matter and we may decide to include features such as image dimensions, predominant colors, corners, and edges histograms.

Luckily, you don’t have to implement or know all these features, as many helpful languages, libraries and APIs like NumPy, MatLab and Rare, are available to automatically extract useful (arrays of) numerical features. As soon as you have a real dataset full of features, Amazon Machine Learning will take care of the rest.

The tricky part you should keep in mind is that the very same features extraction logic will have to be executed before each classification request and for each image, since your AmazonML model has been trained that way and will expect the same features at runtime. My suggestion would be to either implement the feature extraction functionality in the same language of your webapp (i.e. Python) or design it as a WebService/API, so that any component of your stack will be able to call it without worrying too much about the complex technology behind it.

Conclusions

Besides the complexity of multimedia classification, which will hopefully be addressed by AWS soon, I think that Amazon Mechanical Turk and other crowdsourcing platforms can be very useful in helping you to build your machine learning model from an unlabelled dataset.

Other solutions could involve unsupervised learning techniques, such as clustering and neural networks, which are pretty good at identifying patterns and structures in unlabelled data. However for most tasks, they are still far behind human intelligence. “Low-tech” solutions involving real humans will probably bring much higher accuracy, with an acceptable trade off between cost, complexity, and speed.

Browse Cloud Academy’s library for all machine learning training material.

Avatar

Written by

Alex Casalboni

Alex is a Software Engineer with a great passion for music and web technologies. He's experienced in web development and software design, with a particular focus on frontend and UX.

Related Posts

Avatar
John Chell
— June 13, 2019

AWS Certified Solutions Architect Associate: A Study Guide

The AWS Solutions Architect - Associate Certification (or Sol Arch Associate for short) offers some clear benefits: Increases marketability to employers Provides solid credentials in a growing industry (with projected growth of as much as 70 percent in five years) Market anal...

Read more
  • AWS
  • AWS Certifications
Chris Gambino and Joe Niemiec
Chris Gambino and Joe Niemiec
— June 11, 2019

Moving Data to S3 with Apache NiFi

Moving data to the cloud is one of the cornerstones of any cloud migration. Apache NiFi is an open source tool that enables you to easily move and process data using a graphical user interface (GUI).  In this blog post, we will examine a simple way to move data to the cloud using NiFi c...

Read more
  • AWS
  • S3
Avatar
Chandan Patra
— June 11, 2019

Amazon DynamoDB: 10 Things You Should Know

Amazon DynamoDB is a managed NoSQL service with strong consistency and predictable performance that shields users from the complexities of manual setup.Whether or not you've actually used a NoSQL data store yourself, it's probably a good idea to make sure you fully understand the key ...

Read more
  • AWS
  • DynamoDB
Avatar
Andrew Larkin
— June 6, 2019

The 11 AWS Certifications: Which is Right for You and Your Team?

As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in cloud computing.As the market leader and most ma...

Read more
  • AWS
  • AWS Certifications
Sam Ghardashem
Sam Ghardashem
— May 15, 2019

Aviatrix Integration of a NextGen Firewall in AWS Transit Gateway

Learn how Aviatrix’s intelligent orchestration and control eliminates unwanted tradeoffs encountered when deploying Palo Alto Networks VM-Series Firewalls with AWS Transit Gateway.Deploying any next generation firewall in a public cloud environment is challenging, not because of the f...

Read more
  • AWS
Joe Nemer
Joe Nemer
— May 3, 2019

AWS Config Best Practices for Compliance

Use AWS Config the Right Way for Successful ComplianceIt’s well-known that AWS Config is a powerful service for monitoring all changes across your resources. As AWS Config has constantly evolved and improved over the years, it has transformed into a true powerhouse for monitoring your...

Read more
  • AWS
  • Compliance
Avatar
Francesca Vigliani
— April 30, 2019

Cloud Academy is Coming to the AWS Summits in Atlanta, London, and Chicago

Cloud Academy is a proud sponsor of the 2019 AWS Summits in Atlanta, London, and Chicago. We hope you plan to attend these free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. These events are all about learning. You can learn how t...

Read more
  • AWS
  • AWS Summits
Paul Hortop
Paul Hortop
— April 2, 2019

How to Monitor Your AWS Infrastructure

The AWS cloud platform has made it easier than ever to be flexible, efficient, and cost-effective. However, monitoring your AWS infrastructure is the key to getting all of these benefits. Realizing these benefits requires that you follow AWS best practices which constantly change as AWS...

Read more
  • AWS
  • Monitoring
Joe Nemer
Joe Nemer
— April 1, 2019

AWS EC2 Instance Types Explained

Amazon Web Services’ resource offerings are constantly changing, and staying on top of their evolution can be a challenge. Elastic Cloud Compute (EC2) instances are one of their core resource offerings, and they form the backbone of most cloud deployments. EC2 instances provide you with...

Read more
  • AWS
  • EC2
Avatar
Nitheesh Poojary
— March 26, 2019

How DNS Works – the Domain Name System (Part One)

Before migrating domains to Amazon's Route53, we should first make sure we properly understand how DNS worksWhile we'll get to AWS's Route53 Domain Name System (DNS) service in the second part of this series, I thought it would be helpful to first make sure that we properly understand...

Read more
  • AWS
Avatar
Stuart Scott
— March 14, 2019

Multiple AWS Account Management using AWS Organizations

As businesses expand their footprint on AWS and utilize more services to build and deploy their applications, it becomes apparent that multiple AWS accounts are required to manage the environment and infrastructure.  A multi-account strategy is beneficial for a number of reasons as ...

Read more
  • AWS
  • Identity Access Management
Avatar
Sanket Dangi
— February 11, 2019

WaitCondition Controls the Pace of AWS CloudFormation Templates

AWS's WaitCondition can be used with CloudFormation templates to ensure required resources are running.As you may already be aware, AWS CloudFormation is used for infrastructure automation by allowing you to write JSON templates to automatically install, configure, and bootstrap your ...

Read more
  • AWS
  • CloudFormation