How to use Mechanical Turk in combination with Amazon ML for dataset labelling
Whether you build your own machine learning models in the Cloud or using complex mathematical tools, one of the most expensive and time consuming part of building your model is likely to be generating a high-quality dataset.
Sometimes you already have a large amount of historical data and a precise ground truth knowledge about each data point, in which case your dataset is already labelled and all you need to do is clean, normalize, sub-sample, analyze, and train a model, and then iterate until you achieve a good evaluation.
But more often, all you have is a big bucket of raw unlabelled data and the process of manually building a consistent ground truth might be the most painful phase of your machine learning workflow. Some of these scenarios are well covered by companies and services that provide subject matter expertise about your specific context (linguistics, semantics, statistics, etc), usually at a very high cost. Other contexts, for example in the case of multimedia annotations, are way harder to handle, and it turns out that crowdsourcing might be a great way to cut down both costs and time.
What is Amazon Mechanical Turk?
Mechanical Turk – or MTurk – is a crowdsourcing marketplace where you (as a Requester) can publish and coordinate a wide set of Human Intelligence Tasks (HITs), such as classification, tagging, surveys, and transcriptions. Other users (as Workers) can choose your tasks and earn a small amount of money for each completed task.
The platform provides useful tools to accurately describe your task, specify consensus rules, and the amount you will spend for each item. Roughly, considering a $0.30 reward for each task and only one submission for each item, you could label a 1,000 record-dataset for as little as $300 (plus fees) in a few hours. This might just be cheap, fast, and accurate enough.
In case your task is particularly tough, you can raise the number of submissions to two and eventually lower the reward to $0.20, resulting in a total cost of $400, and so on until you find the best trade off between quality and cost. As a general rule, one well-rewarded task usually brings more quality than two cheap ones.
A real-world labelling example
Let’s consider a simple use case. Suppose you want to understand whether your website users have uploaded a good-looking profile picture or something else (i.e. an abstract avatar, a landscape, a group picture, etc). This might make sense if your website is a hiring platform, or some kind of app where mutual trust and real human interactions are important elements. Of course there are plenty of “as-a-service” solutions out there that might also help you for this kind of project, but this is just an artificial example.
First of all, you’ll have to sign up on the official MTurk website and create a new project. The platform provides a useful set of preconfigured tasks. In our case we can select “Categorization“.
Then we need to create a list of possible categories, optionally containing sub-categories. For our classification problem we will be totally OK with a binary classifier (i.e. “good profile picture” or “bad profile picture”), but since we are paying for the task we’d better retrieve as much data as possible. Therefore I defined a short list of categories so that we will have the flexibility of choosing which “good or bad” category afterwards.
The next step is to describe your task and, optionally, provide additional information (like real examples or doubtful cases), so your workers will know what each category should include or exclude. The “general instructions” section is very important as well, as it should attract high-quality workers and accurately define the context, but preferably without being too verbose.
At the end of the task configuration phase, you can either upload a CSV file or use the Mechanical Turk API to provide the items to classify. In the case of images, you can only provide a public URL that will be served to a worker along with your task description and any additional fields you set as visible. I uploaded a simple CSV file with 3 rows, each one containing only a UserID (hidden) and an ImageURL.
Finally, you are shown a checkout preview where you can choose how much each single task will cost and how many times it should be processed to find consensus.
As soon as you confirm these options and proceed with the payment, your tasks will start being served until each record of your dataset is classified.
How to build a model from Mechanical Turk results
Amazon Mechanical Turk will notify you when your results are ready and you will finally have a labelled dataset. In some cases, a few records might not have achieved any consensus, so could either improve your task instructions or, if the remaining dataset is big and statistically distributed enough to generate a useful model, simply discard them.
Our next step will be to upload our labelled dataset into Amazon Machine Learning, create a DataSource, and go through the model training and evaluation phases.
But how do you classify images on Amazon Machine Learning?
Unfortunately, AmazonML doesn’t yet provide any high-level classification tools for multimedia objects like images, audio, or video. Hopefully they will add this kind of functionality soon, but until then you will have to take care of everything related to the features extraction process. Of course you can’t just give AmazonML a public URL or a binary string, so you will need to add some complexity to your dataset.
Generally speaking, each multimedia classification problem might need different features depending on which kind of classification you are trying to achieve (i.e. is color important? maybe shapes are more relevant?). In our case, I would say that both color and shape matter and we may decide to include features such as image dimensions, predominant colors, corners, and edges histograms.
Luckily, you don’t have to implement or know all these features, as many helpful languages, libraries and APIs like NumPy, MatLab and Rare, are available to automatically extract useful (arrays of) numerical features. As soon as you have a real dataset full of features, Amazon Machine Learning will take care of the rest.
The tricky part you should keep in mind is that the very same features extraction logic will have to be executed before each classification request and for each image, since your AmazonML model has been trained that way and will expect the same features at runtime. My suggestion would be to either implement the feature extraction functionality in the same language of your webapp (i.e. Python) or design it as a WebService/API, so that any component of your stack will be able to call it without worrying too much about the complex technology behind it.
Besides the complexity of multimedia classification, which will hopefully be addressed by AWS soon, I think that Amazon Mechanical Turk and other crowdsourcing platforms can be very useful in helping you to build your machine learning model from an unlabelled dataset.
Other solutions could involve unsupervised learning techniques, such as clustering and neural networks, which are pretty good at identifying patterns and structures in unlabelled data. However for most tasks, they are still far behind human intelligence. “Low-tech” solutions involving real humans will probably bring much higher accuracy, with an acceptable trade off between cost, complexity, and speed.
Browse Cloud Academy’s library for all machine learning training material.
What Exactly Is a Cloud Architect and How Do You Become One?
One of the buzzwords surrounding the cloud that I'm sure you've heard is "Cloud Architect." In this article, I will outline my understanding of what a cloud architect does and I'll analyze the skills and certifications necessary to become one. I will also list some of the types of jobs ...
Content Roadmap: AZ-500, ITIL 4, MS-100, Google Cloud Associate Engineer, and More
Last month, Cloud Academy joined forces with QA, the UK’s largest B2B skills provider, and it put us in an excellent position to solve a massive skills gap problem. As a result of this collaboration, you will see our training library grow with additions from QA’s massive catalog of 500+...
DevSecOps: How to Secure DevOps Environments
Security has been a friction point when discussing DevOps. This stems from the assumption that DevOps teams move too fast to handle security concerns. This makes sense if Information Security (InfoSec) is separate from the DevOps value stream, or if development velocity exceeds the band...
Test Your Cloud Knowledge on AWS, Azure, or Google Cloud Platform
Cloud skills are in demand | In today's digital era, employers are constantly seeking skilled professionals with working knowledge of AWS, Azure, and Google Cloud Platform. According to the 2019 Trends in Cloud Transformation report by 451 Research: Business and IT transformations re...
Disadvantages of Cloud Computing
If you want to deliver digital services of any kind, you’ll need to estimate all types of resources, not the least of which are CPU, memory, storage, and network connectivity. Which resources you choose for your delivery — cloud-based or local — is up to you. But you’ll definitely want...
Google Cloud vs AWS: A Comparison (or can they be compared?)
The "Google Cloud vs AWS" argument used to be a common discussion among our members, but is this still really a thing? You may already know that there are three major players in the public cloud platforms arena: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP)...
Deployment Orchestration with AWS Elastic Beanstalk
If you're responsible for the development and deployment of web applications within your AWS environment for your organization, then it's likely you've heard of AWS Elastic Beanstalk. If you are new to this service, or simply need to know a bit more about the service and the benefits th...
How to Use & Install the AWS CLI
What is the AWS CLI? | The AWS Command Line Interface (CLI) is for managing your AWS services from a terminal session on your own client, allowing you to control and configure multiple AWS services and implement a level of automation. If you’ve been using AWS for some time and feel...
Cloud Academy’s Blog Digest: July 2019
July has been a very exciting month for us at Cloud Academy. On July 10, we officially joined forces with QA, the UK’s largest B2B skills provider (read the announcement). Over the coming weeks, you will see additions from QA’s massive catalog of 500+ certification courses and 1500+ ins...
AWS Fundamentals: Understanding Compute, Storage, Database, Networking & Security
If you are just starting out on your journey toward mastering AWS cloud computing, then your first stop should be to understand the AWS fundamentals. This will enable you to get a solid foundation to then expand your knowledge across the entire AWS service catalog. It can be both d...
How to Become a DevOps Engineer
The DevOps Handbook introduces DevOps as a framework for improving the process for converting a business hypothesis into a technology-enabled service that delivers value to the customer. This process is called the value stream. Accelerate finds that applying DevOps principles of flow, f...
AWS AMI Virtualization Types: HVM vs PV (Paravirtual VS Hardware VM)
Amazon Machine Images (AWS AMI) offers two types of virtualization: Paravirtual (PV) and Hardware Virtual Machine (HVM). Each solution offers its own advantages. When we’re using AWS, it’s easy for someone — almost without thinking — to choose which AMI flavor seems best when spinning...