Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk

Published 2009 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Manual evaluation of translation quality is generally thought to be excessively time consuming and expensive. We explore a fast and inexpensive way of doing it using Amazon's Mechanical Turk to pay small sums to a large number of non-expert annotators. For $10 we redundantly recreate judgments from a WMT08 translation task. We find that when combined non-expert judgments have a high-level of agreement with the existing gold-standard judgments of machine translation quality, and correlate more strongly with expert judgments than Bleu does. We go on to show that Mechanical Turk can be used to calculate human-mediated translation edit rate (HTER), to conduct reading comprehension experiments with machine translation, and to create high quality reference translations.

PUBLICATION RECORD

Publication year
2009
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
2009-08-06
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.3115/1699510.1699548
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Findings of the 2009 Workshop on Statistical Machine Translation
2009influential reference
Feasibility of Human-in-the-loop Minimum Error Rate Training
2009cited by this paper
Further Meta-Evaluation of Machine Translation
2008cited by this paper
Decomposability of Translation Metrics for Improved Evaluation and Efficient Algorithms
2008cited by this paper
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks
2008cited by this paper
Combining Outputs from Multiple Machine Translation Systems
2007cited by this paper
A Study of Translation Edit Rate with Targeted Human Annotation
2006cited by this paper
Computing Consensus Translation for Multiple Machine Translation Systems Using Enhanced Hypothesis Alignment
2006cited by this paper
Overview of the IWSLT06 evaluation campaign
2006cited by this paper
Re-evaluating the Role of Bleu in Machine Translation Research
2006cited by this paper
Measuring Translation Quality by Testing English Speakers with a New Defense Language Proficiency Test for Arabic
2005cited by this paper
A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION
2005cited by this paper
Extending the BLEU MT Evaluation Method with Frequency Weightings
2004cited by this paper
A New Monotonic and Clone-Independent Single-Winner Election Method
2003cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper

CITED BY

Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study
2025cites this paper
Assessing writing quality using crowdsourced non-expert comparative judgement ratings
2025cites this paper
Mapping the Landscape of Abusive Content Detection in Social Networks: A Comprehensive and Scientometric Analysis
2025cites this paper
LLMs outperform outsourced human coders on complex textual analysis
2025cites this paper
Bridging Perceptual Gaps in Food NLP: A Structured Approach Using Sensory Anchors
2025cites this paper
Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality
2025cites this paper
AraEyebility: Eye-Tracking Data for Arabic Text Readability
2025cites this paper
Survey for Landing Generative AI in Social and E-commerce Recsys - the Industry Perspectives
2024cites this paper
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
2024cites this paper
RETRACTED: A quantitative evaluation of online translators using Hindi web queries
2024cites this paper
Better Synthetic Data by Retrieving and Transforming Existing Datasets
2024cites this paper
MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation
2024cites this paper
Crowdsourcing lexical diversity
2024cites this paper
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
2024cites this paper
Belief Miner: A Methodology for Discovering Causal Beliefs and Causal Illusions from General Populations
2024cites this paper
LLM Harmony: Multi-Agent Communication for Problem Solving
2024cites this paper
Evaluation Briefs: Drawing on Translation Studies for Human Evaluation of MT
2024cites this paper
Public opinion evaluation on social media platforms: a case study of High Speed 2 (HS2) rail infrastructure project
2023cites this paper
LOUC: Leave-One-Out-Calibration Measure for Analyzing Human Matcher Performance
2023cites this paper
Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms
2023cites this paper
Selecting the Optimal Number of Crowd Workers for Forecasting Tasks
2023cites this paper
The Dark Side of Recruitment in Crowdsourcing: Ethics and Transparency in Micro-Task Marketplaces
2023cites this paper
Tragedy of the Commons in Crowd Work-Based Research
2023cites this paper
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
2023cites this paper
Text Style Transfer Evaluation Using Large Language Models
2023cites this paper
The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony and Sarcasm Generation
2023cites this paper
Angler: Helping Machine Translation Practitioners Prioritize Model Improvements
2023cites this paper
Cura: Curation at Social Media Scale
2023cites this paper
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
2023cites this paper
Explaining immigrant threat perceptions and pro‐immigrant collective action intentions through issue‐specific moral conviction and general need for closure: The case of the US–Mexico border wall
2022cites this paper
Improving Label Quality by Jointly Modeling Items and Annotators
2022cites this paper
Efficient and adaptive incentive selection for crowdsourcing contests
2022cites this paper
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources
2022cites this paper
A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization
2022cites this paper
PostMe: Unsupervised Dynamic Microtask Posting For Efficient and Reliable Crowdsourcing
2022cites this paper
How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?
2022cites this paper
The original sin of crowd work for human subjects research
2022cites this paper
HumanAL: calibrating human matching beyond a single task
2022cites this paper
Reliance and Automation for Human-AI Collaborative Data Labeling Conflict Resolution
2022cites this paper
Crowdsourcing Formulaic
2022cites this paper
Crowdsourced Adaptive Comparative Judgment: A Community‐Based Solution for Proficiency Rating
2022cites this paper
ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution
2022cites this paper
Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature
2022cites this paper
What scholars and IRBs talk when they talk about the Belmont principles in crowd work‐based research
2022cites this paper
The Effects of Feedback and Goal on the Quality of Crowdsourcing Tasks
2021cites this paper
The Hive Mind at Work: Crowdsourcing E-Tourism Research
2021cites this paper
EasyTurk: A User-Friendly Interface for High-Quality Linguistic Annotation with Amazon Mechanical Turk
2021cites this paper
Better Crowdcoding: Strategies for Promoting Accuracy in Crowdsourced Content Analysis
2021cites this paper
The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality
2021cites this paper
Improving Human Text Simplification with Sentence Fusion
2021cites this paper
Towards Hybrid Human-Machine Workflow for Natural Language Generation
2021cites this paper
Can We Stop the Spread of False Information on Vaccination? How Online Comments on Vaccination News Affect Readers’ Credibility Assessments and Sharing Behaviors
2021cites this paper
Emerging trends: A gentle introduction to fine-tuning
2021cites this paper
Language Translation as a Socio-Technical System:Case-Studies of Mixed-Initiative Interactions
2021cites this paper
Cost Effective Annotation Framework Using Zero-Shot Text Classification
2021cites this paper
Improving Label Quality by Jointly Modeling Items and Annotators
2021cites this paper
A review and experimental analysis of active learning over crowdsourced data
2021cites this paper
Diversity in sociotechnical machine learning systems
2021cites this paper
Training Strategies for Promoting Accuracy in Crowdsourced Content Analysis
2021cites this paper
Performance Comparison of Bootstrapped Statistical Taggers on Urdu Tweets
2021cites this paper
Bootstrapping Dependency Treebank of Urdu Noisy Text
2021cites this paper
From collection curation to knowledge creation: Exploring new roles of academic librarians in digital humanities research
2021cites this paper
LMTurk: Few-Shot Learners as Crowdsourcing Workers
2021cites this paper
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?
2021cites this paper
Rethinking Crowdsourcing Annotation: Partial Annotation With Salient Labels for Multilabel Aerial Image Classification
2021cites this paper
Modeling Performance in Open-Domain Dialogue with PARADISE
2021cites this paper
Crowdsourcing formulaic phrases: towards a new type of spoken corpus
2020cites this paper
Educational Content Linking for Enhancing Learning Need Remediation in MOOCs
2020cites this paper
Human-in-the-Loop Learning From Crowdsourcing and Social Media
2020influential citation
Learning to Characterize Matching Experts
2020cites this paper
Machine translation models for Cantonese-English translation
2020cites this paper
Find truth in the hands of the few: acquiring specific knowledge with crowdsourcing
2020cites this paper
A crowdsourcing approach to construct mono-lingual plagiarism detection corpus
2020cites this paper
A Survey of Evaluation Metrics Used for NLG Systems
2020cites this paper
Learning from Human Feedback: Challenges for Real-World Reinforcement Learning in NLP
2020cites this paper
BLEU Might Be Guilty but References Are Not Innocent
2020cites this paper
Towards a Reliable and Robust Methodology for Crowd-Based Subjective Quality Assessment of Query-Based Extractive Text Summarization
2020cites this paper
Information retrieval: a view from the Chinese IR community
2020cites this paper
Image annotation: the effects of content, lexicon and annotation method
2020cites this paper
Identifying and Modeling Code-Switched Language
2020cites this paper
Offline Reinforcement Learning from Human Feedback in Real-World Sequence-to-Sequence Tasks
2020cites this paper
GPM: A Generic Probabilistic Model to Recover Annotator's Behavior and Ground Truth Labeling
2020cites this paper
A Set of Recommendations for Assessing Human-Machine Parity in Language Translation
2020influential citation
Bootstrapping a Crosslingual Semantic Parser
2020cites this paper
Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation
2020cites this paper
Automatic identification of eyewitness messages on twitter during disasters
2020cites this paper
Crowdsourcing versus the laboratory: Towards crowd-based linguistic text quality assessment of query-based extractive summarization
2020cites this paper
Egoistic and altruistic motivation: How to induce users' willingness to help for imperfect AI
2019cites this paper
Twitter Job/Employment Corpus: A Dataset of Job-Related Discourse Built with Humans in the Loop
2019cites this paper
"You Can Do It!" - Crowdsourcing Motivational Speech and Text Messages
2019cites this paper
Crowdsourcing Images for Global Diversity
2019cites this paper
Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses
2019cites this paper
Learning to Predict Population-Level Label Distributions
2019cites this paper
Japanese grammatical simplification with simplified corpus
2019cites this paper
On the evaluation and selection of classifier learning algorithms with crowdsourced data
2019cites this paper
Automatically Neutralizing Subjective Bias in Text
2019cites this paper
An exploratory design science study on theory testing using crowdsourcing
2019cites this paper
How to evaluate machine translation: A review of automated and human metrics
2019influential citation
Pursuing Actionable Interpretations of Non-Literal Language
2019cites this paper
The Practice of Crowdsourcing
2019cites this paper