How to use Mechanical Turk in combination with Amazon ML for dataset labelling
Whether you build your own machine learning models in the Cloud or using complex mathematical tools, one of the most expensive and time consuming part of building your model is likely to be generating a high-quality dataset.
Sometimes you already have a large amount of historical data and a precise ground truth knowledge about each data point, in which case your dataset is already labelled and all you need to do is clean, normalize, sub-sample, analyze, and train a model, and then iterate until you achieve a good evaluation.
But more often, all you have is a big bucket of raw unlabelled data and the process of manually building a consistent ground truth might be the most painful phase of your machine learning workflow. Some of these scenarios are well covered by companies and services that provide subject matter expertise about your specific context (linguistics, semantics, statistics, etc), usually at a very high cost. Other contexts, for example in the case of multimedia annotations, are way harder to handle, and it turns out that crowdsourcing might be a great way to cut down both costs and time.
What is Amazon Mechanical Turk?
Mechanical Turk – or MTurk – is a crowdsourcing marketplace where you (as a Requester) can publish and coordinate a wide set of Human Intelligence Tasks (HITs), such as classification, tagging, surveys, and transcriptions. Other users (as Workers) can choose your tasks and earn a small amount of money for each completed task.
The platform provides useful tools to accurately describe your task, specify consensus rules, and the amount you will spend for each item. Roughly, considering a $0.30 reward for each task and only one submission for each item, you could label a 1,000 record-dataset for as little as $300 (plus fees) in a few hours. This might just be cheap, fast, and accurate enough.
In case your task is particularly tough, you can raise the number of submissions to two and eventually lower the reward to $0.20, resulting in a total cost of $400, and so on until you find the best trade off between quality and cost. As a general rule, one well-rewarded task usually brings more quality than two cheap ones.
A real-world labelling example
Let’s consider a simple use case. Suppose you want to understand whether your website users have uploaded a good-looking profile picture or something else (i.e. an abstract avatar, a landscape, a group picture, etc). This might make sense if your website is a hiring platform, or some kind of app where mutual trust and real human interactions are important elements. Of course there are plenty of “as-a-service” solutions out there that might also help you for this kind of project, but this is just an artificial example.
First of all, you’ll have to sign up on the official MTurk website and create a new project. The platform provides a useful set of preconfigured tasks. In our case we can select “Categorization“.
Then we need to create a list of possible categories, optionally containing sub-categories. For our classification problem we will be totally OK with a binary classifier (i.e. “good profile picture” or “bad profile picture”), but since we are paying for the task we’d better retrieve as much data as possible. Therefore I defined a short list of categories so that we will have the flexibility of choosing which “good or bad” category afterwards.
The next step is to describe your task and, optionally, provide additional information (like real examples or doubtful cases), so your workers will know what each category should include or exclude. The “general instructions” section is very important as well, as it should attract high-quality workers and accurately define the context, but preferably without being too verbose.
At the end of the task configuration phase, you can either upload a CSV file or use the Mechanical Turk API to provide the items to classify. In the case of images, you can only provide a public URL that will be served to a worker along with your task description and any additional fields you set as visible. I uploaded a simple CSV file with 3 rows, each one containing only a UserID (hidden) and an ImageURL.
Finally, you are shown a checkout preview where you can choose how much each single task will cost and how many times it should be processed to find consensus.
As soon as you confirm these options and proceed with the payment, your tasks will start being served until each record of your dataset is classified.
How to build a model from Mechanical Turk results
Amazon Mechanical Turk will notify you when your results are ready and you will finally have a labelled dataset. In some cases, a few records might not have achieved any consensus, so could either improve your task instructions or, if the remaining dataset is big and statistically distributed enough to generate a useful model, simply discard them.
Our next step will be to upload our labelled dataset into Amazon Machine Learning, create a DataSource, and go through the model training and evaluation phases.
But how do you classify images on Amazon Machine Learning?
Unfortunately, AmazonML doesn’t yet provide any high-level classification tools for multimedia objects like images, audio, or video. Hopefully they will add this kind of functionality soon, but until then you will have to take care of everything related to the features extraction process. Of course you can’t just give AmazonML a public URL or a binary string, so you will need to add some complexity to your dataset.
Generally speaking, each multimedia classification problem might need different features depending on which kind of classification you are trying to achieve (i.e. is color important? maybe shapes are more relevant?). In our case, I would say that both color and shape matter and we may decide to include features such as image dimensions, predominant colors, corners, and edges histograms.
Luckily, you don’t have to implement or know all these features, as many helpful languages, libraries and APIs like NumPy, MatLab and Rare, are available to automatically extract useful (arrays of) numerical features. As soon as you have a real dataset full of features, Amazon Machine Learning will take care of the rest.
The tricky part you should keep in mind is that the very same features extraction logic will have to be executed before each classification request and for each image, since your AmazonML model has been trained that way and will expect the same features at runtime. My suggestion would be to either implement the feature extraction functionality in the same language of your webapp (i.e. Python) or design it as a WebService/API, so that any component of your stack will be able to call it without worrying too much about the complex technology behind it.
Besides the complexity of multimedia classification, which will hopefully be addressed by AWS soon, I think that Amazon Mechanical Turk and other crowdsourcing platforms can be very useful in helping you to build your machine learning model from an unlabelled dataset.
Other solutions could involve unsupervised learning techniques, such as clustering and neural networks, which are pretty good at identifying patterns and structures in unlabelled data. However for most tasks, they are still far behind human intelligence. “Low-tech” solutions involving real humans will probably bring much higher accuracy, with an acceptable trade off between cost, complexity, and speed.
Browse Cloud Academy’s library for all machine learning training material.
New Content: Platforms, Programming, and DevOps – Something for Everyone
This month our team of expert certification specialists released three new or updated learning paths, 16 courses, 13 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon....
Mastering AWS Organizations Service Control Policies
Service Control Policies (SCPs) are IAM-like policies to manage permissions in AWS Organizations. SCPs restrict the actions allowed for accounts within the organization making each one of them compliant with your guidelines. SCPs are not meant to grant permissions; you should consider ...
New Content: Focus on DevOps and Programming Content this Month
This month our team of expert certification specialists released 12 new or updated learning paths, 15 courses, 25 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon. Ja...
New Content: Get Ready for the CISM Cert Exam & Learn About Alibaba, Plus All the AWS, GCP, and Azure Courses You Know You Can Count On
This month our team of intrepid certification specialists released five learning paths, seven courses, 19 hands-on labs, and three lab challenges! One particularly interesting new learning path is Certified Information Security Manager (CISM) Foundations. After completing this learn...
Which Certifications Should I Get?
The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...
The 12 AWS Certifications: Which is Right for You and Your Team?
As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in cloud computing. As the market leader and most ma...
AWS Certified Solutions Architect Associate: A Study Guide
Want to take a really impactful step in your technical career? Explore the AWS Solutions Architect Associate certificate. Its new version (SAA-C02) was released on March 23, 2020. The AWS Solutions Architect - Associate Certification (or Sol Arch Associate for short) offers some ...
New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses
This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...
Where Should You Be Focusing Your AWS Security Efforts?
Another day, another re:Invent session! This time I listened to Stephen Schmidt’s session, “AWS Security: Where we've been, where we're going.” Amongst covering the highlights of AWS security during 2020, a number of newly added AWS features/services were discussed, including: AWS Audit...
AWS re:Invent: 2020 Keynote Top Highlights and More
We’ve gotten through the first five days of the special all-virtual 2020 edition of AWS re:Invent. It’s always a really exciting time for practitioners in the field to see what features and services AWS has cooked up for the year ahead. This year’s conference is a marathon and not a...
WARNING: Great Cloud Content Ahead
At Cloud Academy, content is at the heart of what we do. We work with the world’s leading cloud and operations teams to develop video courses and learning paths that accelerate teams and drive digital transformation. First and foremost, we listen to our customers’ needs and we stay ahea...
Excelling in AWS, Azure, and Beyond – How Danut Prisacaru Prepares for the Future
Meet Danut Prisacaru. Danut has been a Software Architect for the past 10 years and has been involved in Software Engineering for 30 years. He’s passionate about software and learning, and jokes that coding is basically the only thing he can do well (!). We think his enthusiasm shines t...