Amazon Mechanical Turk: Help for Building Your Machine Learning Datasets

How to use Mechanical Turk in combination with Amazon ML for dataset labelling

Whether you build your own machine learning models in the Cloud or using complex mathematical tools, one of the most expensive and time consuming part of building your model is likely to be generating a high-quality dataset.

Sometimes you already have a large amount of historical data and a precise ground truth knowledge about each data point, in which case your dataset is already labelled and all you need to do is clean, normalize, sub-sample, analyze, and train a model, and then iterate until you achieve a good evaluation.

But more often, all you have is a big bucket of raw unlabelled data and the process of manually building a consistent ground truth might be the most painful phase of your machine learning workflow. Some of these scenarios are well covered by companies and services that provide subject matter expertise about your specific context (linguistics, semantics, statistics, etc), usually at a very high cost. Other contexts, for example in the case of multimedia annotations, are way harder to handle, and it turns out that crowdsourcing might be a great way to cut down both costs and time.

What is Amazon Mechanical Turk?

Mechanical Turk – or MTurk – is a crowdsourcing marketplace where you (as a Requester) can publish and coordinate a wide set of Human Intelligence Tasks (HITs), such as classification, tagging, surveys, and transcriptions. Other users (as Workers) can choose your tasks and earn a small amount of money for each completed task.

The platform provides useful tools to accurately describe your task, specify consensus rules, and the amount you will spend for each item. Roughly, considering a $0.30 reward for each task and only one submission for each item, you could label a 1,000 record-dataset for as little as $300 (plus fees) in a few hours. This might just be cheap, fast, and accurate enough.

In case your task is particularly tough, you can raise the number of submissions to two and eventually lower the reward to $0.20, resulting in a total cost of $400, and so on until you find the best trade off between quality and cost. As a general rule, one well-rewarded task usually brings more quality than two cheap ones.

A real-world labelling example

Let’s consider a simple use case. Suppose you want to understand whether your website users have uploaded a good-looking profile picture or something else (i.e. an abstract avatar, a landscape, a group picture, etc). This might make sense if your website is a hiring platform, or some kind of app where mutual trust and real human interactions are important elements. Of course there are plenty of “as-a-service” solutions out there that might also help you for this kind of project, but this is just an artificial example.

First of all, you’ll have to sign up on the official MTurk website and create a new project. The platform provides a useful set of preconfigured tasks. In our case we can select “Categorization“.

Mechanical Turk Project Start a New Project
Then we need to create a list of possible categories, optionally containing sub-categories. For our classification problem we will be totally OK with a binary classifier (i.e. “good profile picture” or “bad profile picture”), but since we are paying for the task we’d better retrieve as much data as possible. Therefore I defined a short list of categories so that we will have the flexibility of choosing which “good or bad” category afterwards.
Mechanical Turk Possible Categories
The next step is to describe your task and, optionally, provide additional information (like real examples or doubtful cases), so your workers will know what each category should include or exclude. The “general instructions” section is very important as well, as it should attract high-quality workers and accurately define the context, but preferably without being too verbose.
Mechanical Turk Instructions & Criteria
At the end of the task configuration phase, you can either upload a CSV file or use the Mechanical Turk API to provide the items to classify. In the case of images, you can only provide a public URL that will be served to a worker along with your task description and any additional fields you set as visible. I uploaded a simple CSV file with 3 rows, each one containing only a UserID (hidden) and an ImageURL.

Finally, you are shown a checkout preview where you can choose how much each single task will cost and how many times it should be processed to find consensus.
Mechanical Turk Cost Estimation
As soon as you confirm these options and proceed with the payment, your tasks will start being served until each record of your dataset is classified.

How to build a model from Mechanical Turk results

Amazon Mechanical Turk will notify you when your results are ready and you will finally have a labelled dataset. In some cases, a few records might not have achieved any consensus, so could either improve your task instructions or, if the remaining dataset is big and statistically distributed enough to generate a useful model, simply discard them.

Our next step will be to upload our labelled dataset into Amazon Machine Learning, create a DataSource, and go through the model training and evaluation phases.

But how do you classify images on Amazon Machine Learning?

Unfortunately, AmazonML doesn’t yet provide any high-level classification tools for multimedia objects like images, audio, or video. Hopefully they will add this kind of functionality soon, but until then you will have to take care of everything related to the features extraction process. Of course you can’t just give AmazonML a public URL or a binary string, so you will need to add some complexity to your dataset.

Generally speaking, each multimedia classification problem might need different features depending on which kind of classification you are trying to achieve (i.e. is color important? maybe shapes are more relevant?). In our case, I would say that both color and shape matter and we may decide to include features such as image dimensions, predominant colors, corners, and edges histograms.

Luckily, you don’t have to implement or know all these features, as many helpful languages, libraries and APIs like NumPy, MatLab and Rare, are available to automatically extract useful (arrays of) numerical features. As soon as you have a real dataset full of features, Amazon Machine Learning will take care of the rest.

The tricky part you should keep in mind is that the very same features extraction logic will have to be executed before each classification request and for each image, since your AmazonML model has been trained that way and will expect the same features at runtime. My suggestion would be to either implement the feature extraction functionality in the same language of your webapp (i.e. Python) or design it as a WebService/API, so that any component of your stack will be able to call it without worrying too much about the complex technology behind it.

Conclusions

Besides the complexity of multimedia classification, which will hopefully be addressed by AWS soon, I think that Amazon Mechanical Turk and other crowdsourcing platforms can be very useful in helping you to build your machine learning model from an unlabelled dataset.

Other solutions could involve unsupervised learning techniques, such as clustering and neural networks, which are pretty good at identifying patterns and structures in unlabelled data. However for most tasks, they are still far behind human intelligence. “Low-tech” solutions involving real humans will probably bring much higher accuracy, with an acceptable trade off between cost, complexity, and speed.

Browse Cloud Academy’s library for all machine learning training material.

Avatar

Written by

Alex Casalboni

Alex is a Software Engineer with a great passion for music and web technologies. He's experienced in web development and software design, with a particular focus on frontend and UX.


Related Posts

Joe Nemer
Joe Nemer
— October 14, 2020

New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More

This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs.  New content on Cloud Academy At any ...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— September 15, 2020

New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More

This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— August 28, 2020

AWS Certification Practice Exam: What to Expect from Test Questions

If you’re building applications on the AWS cloud or looking to get started in cloud computing, certification is a way to build deep knowledge in key services unique to the AWS platform. AWS currently offers 12 certifications that cover major cloud roles including Solutions Architect, De...

Read more
  • AWS
  • AWS Certifications
Patrick Navarro
Patrick Navarro
— August 25, 2020

Overcoming Unprecedented Business Challenges with AWS

From auto-scaling applications with high availability to video conferencing that’s used by everyone, every day —  cloud technology has never been more popular or in-demand. But what does this mean for experienced cloud professionals and the challenges they face as they carve out a new p...

Read more
  • AWS
  • Cloud Adoption
  • digital transformation
Avatar
Andrew Larkin
— August 18, 2020

Constant Content: Cloud Academy’s Q3 2020 Roadmap

Hello —  Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...

Read more
  • alibaba
  • AWS
  • Azure
  • content roadmap
  • Content updates
  • DevOps
  • GCP
  • Google Cloud
  • New content
Alisha Reyes
Alisha Reyes
— August 5, 2020

New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More

This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— July 16, 2020

Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More

This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs
Avatar
Cloud Academy Team
— July 9, 2020

Which Certifications Should I Get?

The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— July 2, 2020

New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More

This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— June 19, 2020

Kickstart Your Tech Training With a Free Week on Cloud Academy

Are you looking to make a jump in your technical career? Want to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Kubernetes, Python, or another in-demand skill? Then you'll want to mark your calendar. Starting Monday, June 22 at 12:00 a.m. PDT (3:00 a.m. EDT), ...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Alisha Reyes
Alisha Reyes
— June 11, 2020

New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More

This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Rebecca Willis
Rebecca Willis
— June 3, 2020

Azure vs. AWS: Which Certification Provides the Brighter Future?

More and more companies are using cloud services, prompting more and more people to switch their current IT position to something cloud-related. The problem is most people only have that much time after work to learn new technologies, and there are plenty of cloud services that you can ...

Read more
  • AWS
  • Azure
  • certification