Amazon Mechanical Turk: Help for Building Your Machine Learning Datasets

How to use Mechanical Turk in combination with Amazon ML for dataset labelling

Whether you build your own machine learning models in the Cloud or using complex mathematical tools, one of the most expensive and time consuming part of building your model is likely to be generating a high-quality dataset.

Sometimes you already have a large amount of historical data and a precise ground truth knowledge about each data point, in which case your dataset is already labelled and all you need to do is clean, normalize, sub-sample, analyze, and train a model, and then iterate until you achieve a good evaluation.

But more often, all you have is a big bucket of raw unlabelled data and the process of manually building a consistent ground truth might be the most painful phase of your machine learning workflow. Some of these scenarios are well covered by companies and services that provide subject matter expertise about your specific context (linguistics, semantics, statistics, etc), usually at a very high cost. Other contexts, for example in the case of multimedia annotations, are way harder to handle, and it turns out that crowdsourcing might be a great way to cut down both costs and time.

What is Amazon Mechanical Turk?

Mechanical Turk – or MTurk – is a crowdsourcing marketplace where you (as a Requester) can publish and coordinate a wide set of Human Intelligence Tasks (HITs), such as classification, tagging, surveys, and transcriptions. Other users (as Workers) can choose your tasks and earn a small amount of money for each completed task.

The platform provides useful tools to accurately describe your task, specify consensus rules, and the amount you will spend for each item. Roughly, considering a $0.30 reward for each task and only one submission for each item, you could label a 1,000 record-dataset for as little as $300 (plus fees) in a few hours. This might just be cheap, fast, and accurate enough.

In case your task is particularly tough, you can raise the number of submissions to two and eventually lower the reward to $0.20, resulting in a total cost of $400, and so on until you find the best trade off between quality and cost. As a general rule, one well-rewarded task usually brings more quality than two cheap ones.

A real-world labelling example

Let’s consider a simple use case. Suppose you want to understand whether your website users have uploaded a good-looking profile picture or something else (i.e. an abstract avatar, a landscape, a group picture, etc). This might make sense if your website is a hiring platform, or some kind of app where mutual trust and real human interactions are important elements. Of course there are plenty of “as-a-service” solutions out there that might also help you for this kind of project, but this is just an artificial example.

First of all, you’ll have to sign up on the official MTurk website and create a new project. The platform provides a useful set of preconfigured tasks. In our case we can select “Categorization“.

Mechanical Turk Project Start a New Project
Then we need to create a list of possible categories, optionally containing sub-categories. For our classification problem we will be totally OK with a binary classifier (i.e. “good profile picture” or “bad profile picture”), but since we are paying for the task we’d better retrieve as much data as possible. Therefore I defined a short list of categories so that we will have the flexibility of choosing which “good or bad” category afterwards.
Mechanical Turk Possible Categories
The next step is to describe your task and, optionally, provide additional information (like real examples or doubtful cases), so your workers will know what each category should include or exclude. The “general instructions” section is very important as well, as it should attract high-quality workers and accurately define the context, but preferably without being too verbose.
Mechanical Turk Instructions & Criteria
At the end of the task configuration phase, you can either upload a CSV file or use the Mechanical Turk API to provide the items to classify. In the case of images, you can only provide a public URL that will be served to a worker along with your task description and any additional fields you set as visible. I uploaded a simple CSV file with 3 rows, each one containing only a UserID (hidden) and an ImageURL.

Finally, you are shown a checkout preview where you can choose how much each single task will cost and how many times it should be processed to find consensus.
Mechanical Turk Cost Estimation
As soon as you confirm these options and proceed with the payment, your tasks will start being served until each record of your dataset is classified.

How to build a model from Mechanical Turk results

Amazon Mechanical Turk will notify you when your results are ready and you will finally have a labelled dataset. In some cases, a few records might not have achieved any consensus, so could either improve your task instructions or, if the remaining dataset is big and statistically distributed enough to generate a useful model, simply discard them.

Our next step will be to upload our labelled dataset into Amazon Machine Learning, create a DataSource, and go through the model training and evaluation phases.

But how do you classify images on Amazon Machine Learning?

Unfortunately, AmazonML doesn’t yet provide any high-level classification tools for multimedia objects like images, audio, or video. Hopefully they will add this kind of functionality soon, but until then you will have to take care of everything related to the features extraction process. Of course you can’t just give AmazonML a public URL or a binary string, so you will need to add some complexity to your dataset.

Generally speaking, each multimedia classification problem might need different features depending on which kind of classification you are trying to achieve (i.e. is color important? maybe shapes are more relevant?). In our case, I would say that both color and shape matter and we may decide to include features such as image dimensions, predominant colors, corners, and edges histograms.

Luckily, you don’t have to implement or know all these features, as many helpful languages, libraries and APIs like NumPy, MatLab and Rare, are available to automatically extract useful (arrays of) numerical features. As soon as you have a real dataset full of features, Amazon Machine Learning will take care of the rest.

The tricky part you should keep in mind is that the very same features extraction logic will have to be executed before each classification request and for each image, since your AmazonML model has been trained that way and will expect the same features at runtime. My suggestion would be to either implement the feature extraction functionality in the same language of your webapp (i.e. Python) or design it as a WebService/API, so that any component of your stack will be able to call it without worrying too much about the complex technology behind it.

Conclusions

Besides the complexity of multimedia classification, which will hopefully be addressed by AWS soon, I think that Amazon Mechanical Turk and other crowdsourcing platforms can be very useful in helping you to build your machine learning model from an unlabelled dataset.

Other solutions could involve unsupervised learning techniques, such as clustering and neural networks, which are pretty good at identifying patterns and structures in unlabelled data. However for most tasks, they are still far behind human intelligence. “Low-tech” solutions involving real humans will probably bring much higher accuracy, with an acceptable trade off between cost, complexity, and speed.

Browse Cloud Academy’s library for all machine learning training material.

Avatar

Written by

Alex Casalboni

Alex is a Software Engineer with a great passion for music and web technologies. He's experienced in web development and software design, with a particular focus on frontend and UX.


Related Posts

Alisha Reyes
Alisha Reyes
— August 5, 2020

New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More

This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— July 16, 2020

Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More

This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs
Avatar
Cloud Academy Team
— July 9, 2020

Which Certifications Should I Get?

The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— July 2, 2020

New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More

This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— June 19, 2020

Kickstart Your Tech Training With a Free Week on Cloud Academy

Are you looking to make a jump in your technical career? Want to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Kubernetes, Python, or another in-demand skill? Then you'll want to mark your calendar. Starting Monday, June 22 at 12:00 a.m. PDT (3:00 a.m. EDT), ...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Alisha Reyes
Alisha Reyes
— June 11, 2020

New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More

This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Rebecca Willis
Rebecca Willis
— June 3, 2020

Azure vs. AWS: Which Certification Provides the Brighter Future?

More and more companies are using cloud services, prompting more and more people to switch their current IT position to something cloud-related. The problem is most people only have that much time after work to learn new technologies, and there are plenty of cloud services that you can ...

Read more
  • AWS
  • Azure
  • certification
Alisha Reyes
Alisha Reyes
— June 2, 2020

Blog Digest: 5 Reasons to Get AWS Certified, OWASP Top 10, Getting Started with VPCs, Top 10 Soft Skills, and More

Thank you for being a valued member of our community! We recently sent out a short survey to understand what type of content you would like us to add to Cloud Academy, and we want to thank everyone who gave us their input. If you would like to complete the survey, it's not too late. It ...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs
Alisha Reyes
Alisha Reyes
— May 11, 2020

New Content: Alibaba, Azure Cert Prep: AI-100, AZ-104, AZ-204 & AZ-400, Amazon Athena Playground, Google Cloud Developer Challenge, and much more

This month, our Content Team released 8 new learning paths, 4 courses, 7 labs in real cloud environments, and 4 new knowledge check assessments. Not only that, but we introduced our very first course on Alibaba Cloud, and our expert instructors are working 'round the clock to create 6 n...

Read more
  • alibaba
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Avatar
Rhonda Martinez
— May 4, 2020

Top 5 Reasons to Get AWS Certified Right Now

Cloud computing trends are on the rise and have been for some time already. Fortunately, it’s never too late to start learning cloud computing. Skills like AWS and others associated with cloud computing are in high demand because cloud technologies have become crucial for many businesse...

Read more
  • Amazon Elastic Book Store
  • Amazon Elastic Compute Cloud (EC2)
  • AWS
  • AWS Certifications
  • Glacier
Alisha Reyes
Alisha Reyes
— May 1, 2020

Introducing Our Newest Lab Environments: Lab Playgrounds

Want to train in a real cloud environment, but feel slowed down by spinning up your own deployments? When you consider security or pricing costs, it can be costly and challenging to get up to speed quickly for self-training. To solve this problem, Cloud Academy created a new suite of la...

Read more
  • AWS
  • Azure
  • Docker
  • Google Cloud Platform
  • Java
  • lab playgrounds
  • Python
Alisha Reyes
Alisha Reyes
— April 30, 2020

Blog Digest: AWS Breaking News, Azure DevOps, AWS Study Guide, 8 Ways to Prevent a Ransomware Attack, and More

  New articles by topic AWS Azure Data Science Google Cloud  Cloud Adoption Platform Updates & New Content Security Women in Tech AWS Breaking News: All AWS Certification Exams Now Available Online As an Advanced AWS Technology Partner, C...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • programming
  • Security