crowdsourced testing

How to Make the Most of Mechanical Turk

Picture of Russell Smith
Russell Smith, Thursday October 12, 2017

Amazon’s Mechanical Turk, or MTurk, is a powerful, under-appreciated platform that allows you to allocate work to humans programmatically and at scale. Businesses get access to a vast, scaleable workforce, and workers can select from a variety of tasks whenever they want to work. MTurk can get work done very quickly, with tasks performed in parallel by a multitude of workers. Part of its power lies in the fact that, while it’s programmable, tasks are written in plain English, meaning almost anyone can use it.

The use cases are almost infinite. If work can be sent and returned electronically, it can be performed through Mechanical Turk. Lately, companies have been using it to prepare data for machine learning and data science — think of tagging objects in images or cleaning data. Other uses include transcribing audio, translating text, extracting data from documents, and searching for information, such as phone numbers for all the restaurants in Seattle. MTurk can save you having to buy the data or tie up someone in your organization to find it, which is often a lot slower.

Where does the name come from? The Mechanical Turk was an 18th Century machine that purported to be a chess-playing automaton but was actually an elaborate hoax: A human chess master was hiding inside. Amazon Mechanical Turk is AWS’s modern equivalent — an API with an army of humans behind it. It was launched in 2001, making it one of Amazon’s oldest cloud services, though it’s surprisingly little known.

Mechanical Turk is a highly efficient, cost effective service that enables you to outsource work 24x7. We’ve been using MTurk at my company for about six years to run software QA tests for our customers. It allows us to automate the unautomatable—we can run tests that require a human to perform them, like clicking a button on a website to make sure it works. And MTurk is superfast. Our tasks are typically completed in under 10 minutes.

How Mechanical Turk works

MTurk is a two-sided marketplace that connects workers and “requesters” via a web interface. As a requester, you just create an account and choose from a selection of templated tasks, such as tagging images, that make it easy to get started. You can also build your own tasks from scratch. The service allows you to drill down to the worker level to see who performed your task, and you can send messages telling them they did a good job or explaining how they can do it better next time.

Workers visit mturk.com to see a description of the tasks available and the payment offered. They can preview a task to see if they want to do it, and if not send it back to the pool for someone else. If they accept a task and don’t complete it in the allocated time, that too goes back to the pool.

Task design is critical. If you want good results, tasks need to be designed well and have clear instructions. Workers talk among themselves on message boards, and it’s easy to get a bad reputation if your tasks are consistently hard to perform, so it’s worth getting this part right.

My rules for task design:

  • Provide clear instructions. Tasks don’t need to be simple, but they need to be easily understood. Ask a colleague to check your task to ensure it’s clear.
  • Include a feedback field so workers can tell you if a task takes too long or is too difficult.
  • Take steps to guard against errors or fraud. One way is to throw in a few tests you know the answers to, to make sure they’re being done properly. We also build random sampling into our system.
  • Offer a fair price. This is tricky (see below).
  • Workers love to get into a groove, so give them the option to check the “Automatically go to the next task” button. This allows them to perform tasks quickly in succession, without going back to the pool each time.

How do you set the right price? This is super-important, especially if you’re submitting high volumes of work. The payment can’t be higher than you can reasonably afford or too low to be worthwhile for the worker. Iterate in the following manner until you get it right.

  • Start by working out your budget for the smallest unit of work you’ll submit.
  • Do a small test run to verify that your pricing model works. Are enough people accepting tasks, and are the tasks being performed quickly enough?
  • Tweak as necessary, and repeat until you get it right.

Mechanical Turk is people

Some people treat the workers on MTurk like an API. Remember that they’re human beings and treat them accordingly. Be fair, transparent, and communicative. Like most workers, they’re motivated by payment, status, and pride in what they do. Don’t hesitate to send an email telling them when they’ve done a good job, and you can even pay them a bonus. If you build a relationship with workers, they’re more likely to choose your work and perform it well. Most workers are aged 25 to 35, but there are older and younger people too.

Workers live in all regions. This means they don’t all have English as a first language, which is another reason tasks need to be clear. It also means there are workers online at all times, so you can get work done whenever you need it.

MTurk offers two special types of worker that you might choose to pay a little extra for. Master workers are those who have “demonstrated excellence” across a range of tasks as determined by Amazon’s statistical monitoring. You’re basically paying extra for what Amazon considers a premium worker.

Workers with Premium Qualifications have specific attributes you can request, such as living in a certain country or having an Android phone or an iPhone.

Retention is key. If you plan to use the service a lot, retention is important for your success. You don’t want to waste time training workers on your tasks only to have them leave, and it can take time for them to get up to speed and be efficient.

The workers form communities. Identify the leaders in your community; they’re your ear to the ground.

How to manage MTurk workers

We train our workers with simple YouTube videos created by our company, and we constantly retrain them, such as when they leave the platform for a month and come back. Here are some tips for enabling workers:

  • Workers love to train by videos, because it means they don’t have to read through documents. We’ve made more than a half-dozen videos describing our tasks.
  • Use an online forum or email to ask workers what they need, or what you can do better. We built our own community forum.
  • Listen to complaints. This has saved us some big headaches. We recently added a comment box at the end of our tasks.
  • Each of our customers’ tasks has its own Net Promoter Score, so we can tell them which tasks workers don’t enjoy doing and how to improve them.
  • Join external forums, such as the MTurk subreddit and Turker Nation, where you can sometimes find feedback about your tasks. Remember, your reputation matters, and workers will talk about you. When a worker meets our qualifications for a task, instead of being immediately assigned work, we redirect them to our training course, which is automated to make it efficient. We train them on our tasks, then test them to make sure they were paying attention. If they stop working for a few months, we automatically assign them for training again. We’ll sometimes do very specific training for certain Rainforest customer tasks.

How should you handle workers who aren’t doing tasks well? Start by emailing them. Tell them what they can do better and help them improve. If they can’t improve, you can reject them for tasks via an API or the MTurk interface. You can also use qualifications as a nice way to “soft-ban” workers, i.e. add a qualification you know will make a particular worker ineligible.

It is also possible to block a worker, but it is not an option to be taken lightly. Blocking will likely affect the worker’s reputation, and potentially yours, so be careful how you use it. We’ve blocked very few people over the years.

How MTurk tasks work

Amazon describes tasks as Human Intelligence Tasks, or HITs. A HITType is the highest-level building block, where you store the title, description, reward, and qualifications for a task. Each task becomes a HIT, or the individual thing you want done.

After a worker accepts a HIT, MTurk creates an Assignment to track that task through to completion. Notifications are sent over HTTP or SQS to tell you when the state changes on tasks. These are optional unless you’re doing integrations or other real time work where you need to know the status of tasks.

You can start off using the Developer Sandbox to experiment with creating and responding to HITs without actually spending money. Useful API operations you might need:

  • CreateHIT creates new tasks on MTurk
  • ListHITs retrieves a list of all of your HITs at any time
  • GetAccountBalance checks regularly to see how much credit you have in the system for paying workers
  • RevokeQualification and GrantQualification modify qualifications assigned to tasks
  • ForceExpireHIT cancels a job if you don’t want it done
  • NotifyWorkers sends a message to your workers via their workerID
  • GrantBonus sends a bonus directly to a worker for a particular assignment (we use this extensively)

MTurk offers templates for common uses of the site, such as image tagging. For other tasks, you can use the question types. QuestionForm is the simplest to create. It’s an XML-based form that you can customize with your task. HTML Question allows for a bit more customization using HTML.

There is also an ExternalQuestion type. While the previous question types are hosted by AWS, an ExternalQuestion is hosted on your own server. It’s an iFrame that allows you to embed whatever content you like. And because it’s hosted on your own server, you can change the order in which tasks are distributed among workers, allowing you to prioritize tasks.

Review policies allow you to evaluate the work performed against a defined set of criteria, which helps identify work that’s not being done properly. There are two types of policies: Assignment-level and HIT-level.

Assignment-level policies can validate responses to known answers. You can specify if one question in your HIT has a known answer, and reject the assignment when more than a certain number of known answers are incorrect.

HIT-level policies look for consensus among workers on each HIT. You can automatically compare answers to detect if there’s a majority or consensus answer. You can then optionally reject assignments that don’t match the consensus.

Because we run our business on MTurk, we use a machine-learning backed system to do additional review policies and ensure testers are giving us the right answers.

Scaling Mechanical Turk

Publishing HITs in batch form can save a lot of time if you’re submitting numerous HITs of the same type. If you want workers to tag the objects in 1,000 images, for example, you could upload them together in a CVS file and MTurk will automatically create a separate HIT for each worker, so they can be done in parallel.

This will be fine for most people, but it doesn’t allow you to prioritize the order in which tasks are performed. Using the ExternalQuestion format, where work is queued up on our own servers, we decouple our HITs from the actual assignments that MTurk distributes to workers. That means we can prioritize jobs if they need immediate attention, by changing the order in which our tasks are pushed into MTurk. This also makes it easier to cancel HITs if our customers decide to cancel a job they’ve given us.

We want our tasks to be performed as quickly as possible, but if there’s a break in the work and our workers are offline or inactive for a while, it can take a few minutes to bring them back online when work comes in. To get around this, we list HITs on MTurk even when we don’t have any real work that needs to be performed. Instead, we give workers training videos to watch or we do repeat tests of our own software. That means when real work comes in, our workers are online already and they can get started on the real tasks in a matter of seconds. This obviously won’t be important to everyone, but it’s an incredibly useful trick if you need low latency work.

Mechanical Turk is a powerful service for all kinds of work, and it has become even more useful given the rise of machine learning and the need to clean and prepare large datasets. The work is done by humans, so we get human results: Someone can actually describe a specific outcome they experienced, so we don’t have to dive into data to figure out what happened. But because they’re humans, you need to treat them nicely and train them well. If you do, you won’t be surprised to find they can be really effective.

Reprinted with permission. ©IDG Communications, Inc., 2017. All rights reserved. https://www.infoworld.com/article/3217674/cloud-computing/how-to-make-the-most-of-mechanical-turk.html

Filed under: mturk and crowdsourced testing