The Language Demographics of Amazon Mechanical Turk

Ellie Pavlick,Matt Post,Ann Irvine,Dmitry Kachaev,Chris Callison-Burch

Published 2014 in Transactions of the Association for Computational Linguistics

ABSTRACT

We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anonymous crowd workers that is more robust than simple surveying. We validate workers’ self-reported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they reside in countries where the languages are likely to be spoken. Rather than posting a one-off survey, we posted paid tasks consisting of 1,000 assignments to translate a total of 10,000 words in each of 100 languages. Our study ran for several months, and was highly visible on the MTurk crowdsourcing platform, increasing the chances that bilingual workers would complete it. Our study was useful both to create bilingual dictionaries and to act as census of the bilingual speakers on MTurk. We use this data to recommend languages with the largest speaker populations as good candidates for other researchers who want to develop crowdsourced, multilingual technologies. To further demonstrate the value of creating data via crowdsourcing, we hire workers to create bilingual parallel corpora in six Indian languages, and use them to train statistical machine translation systems.

PUBLICATION RECORD

Publication year
2014
Venue
Transactions of the Association for Computational Linguistics
Publication date
2014-02-28
Fields of study
Sociology, Linguistics, Computer Science
Identifiers
DOI 10.1162/tacl_a_00167
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Technologies
2018cited by this paper
Mechanical Turk is Not Anonymous
2013cited by this paper
Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation
2013cited by this paper
Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment
2013cited by this paper
Machine Translation of Arabic Dialects
2012influential reference
Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing
2012cited by this paper
Active learning and crowdsourcing for machine translation in low resource scenarios
2012cited by this paper
The Value of Monolingual Crowdsourcing in a Real-World Translation Scenario: Simulation using Haitian Creole Emergency SMS Messages
2011cited by this paper
Speaking to the Crowd: Looking at Past Achievements in Using Crowdsourcing for Speech and Predicting Future Challenges
2011cited by this paper
Crowdsourcing Translation: Professional Quality from Non-Professionals
2011cited by this paper
What's the Right Price? Pricing Tasks for Finishing on Time
2011cited by this paper
Active Learning with Amazon Mechanical Turk
2011cited by this paper
Human computation: a survey and taxonomy of a growing field
2011cited by this paper
Knowledge Map of the Virtual Economy: Converting the Virtual Economy into Development Potential
2011cited by this paper
Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection
2011cited by this paper
Collecting Image Annotations Using Amazon’s Mechanical Turk
2010cited by this paper
Analyzing the Amazon Mechanical Turk marketplace
2010cited by this paper
Using the Amazon Mechanical Turk to Transcribe and Annotate Meeting Speech for Extractive Summarization
2010cited by this paper
Tools for Collecting Speech Corpora via Mechanical-Turk
2010cited by this paper
VizWiz: nearly real-time answers to visual questions
2010cited by this paper
Soylent: a word processor with a crowd inside
2010cited by this paper
What Does Classifying More Than 10, 000 Image Categories Tell Us?
2010cited by this paper
Translation by iterative collaboration between monolingual users
2010cited by this paper
Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription
2010cited by this paper
Can Crowds Build parallel corpora for Machine Translation Systems?
2010cited by this paper
Who are the crowdworkers?: shifting demographics in mechanical turk
2010cited by this paper
Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages
2010cited by this paper
Active Learning and Crowd-Sourcing for Machine Translation
2010cited by this paper
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
2010cited by this paper
Demographics of Mechanical Turk
2010cited by this paper
Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation
2010cited by this paper
TurKit: Tools for iterative tasks on mechanical turk
2009cited by this paper
Human computation
2009cited by this paper
Utility data annotation with Amazon Mechanical Turk
2008cited by this paper
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks
2008cited by this paper
Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect?
2001cited by this paper
ATM network: goals and challenges
1995cited by this paper
WordNet: A Lexical Database for English
1995cited by this paper

CITED BY

Human-centric Evaluation of Semantic Resources: A Systematic Mapping Study
2026cites this paper
Segmental Faithfulness to Semantic Heads in Novel Spanish Blends
2025cites this paper
Do data collection methods matter for self-reported L2 individual differences questionnaires? In-person vs crowdsourced data
2025cites this paper
Ingroup Favoritism Surrounding COVID-19 Vaccinations in the Hispanic Communities: Experimental Study
2025cites this paper
A survey of neural-network-based methods utilising comparable data for finding translation equivalents
2024cites this paper
The future of open human feedback
2024cites this paper
Real-time Speech Summarization for Medical Conversations
2024cites this paper
mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans
2024cites this paper
Explaining crowdworker behaviour through computational rationality
2024cites this paper
The episodic encoding of spoken words in Hindi.
2024cites this paper
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis
2024cites this paper
Moving experimental psychology online: How to obtain high quality data when we can’t see our participants
2024influential citation
GPTs Are Multilingual Annotators for Sequence Generation Tasks
2024cites this paper
Unsupervised Bilingual Lexicon Induction for Low Resource Languages
2024cites this paper
The Use of AI-powered Language Tools in Crowdsourcing to reduce Language Barriers
2024cites this paper
Complex Words as Shortest Paths in the Network of Lexical Knowledge
2024cites this paper
Zero and Few-Shot Localization of Task-Oriented Dialogue Agents with a Distilled Representation
2023cites this paper
Neo-Whorfian Examination of Cross-Linguistic Temporal Discounting Behavior
2023cites this paper
Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers
2023influential citation
Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment
2023cites this paper
Lessons Learned from a Citizen Science Project for Natural Language Processing
2023cites this paper
How to do human evaluation: A brief introduction to user studies in NLP
2023cites this paper
Good Night at 4 pm?! Time Expressions in Different Cultures
2022influential citation
Isomorphic Cross-lingual Embeddings for Low-Resource Languages
2022influential citation
TyDiP: A Dataset for Politeness Classification in Nine Typologically Diverse Languages
2022cites this paper
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources
2022influential citation
Fully Attentional Network for Low-Resource Academic Machine Translation and Post Editing
2022cites this paper
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
2022cites this paper
Developing a tool for fair and reproducible use of paid crowdsourcing in the digital humanities
2022cites this paper
Assessing Annotator Identity Sensitivity via Item Response Theory: A Case Study in a Hate Speech Corpus
2022cites this paper
Descartes: Generating Short Descriptions of Wikipedia Articles
2022cites this paper
Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks
2022cites this paper
Topics, Concepts, and Measurement: A Crowdsourced Procedure for Validating Topics as Measures
2021cites this paper
Self-regulation and Autonomy in the Job Search: Key Factors to Support Job Search Among Swiss Job Seekers
2021cites this paper
An Approach to the Frugal Use of Human Annotators to Scale up Auto-coding for Text Classification Tasks
2021cites this paper
Creating a Parallel Corpora for Turkish-English Academic Translations
2021cites this paper
Emerging trends: A gentle introduction to fine-tuning
2021cites this paper
Five sources of bias in natural language processing
2021cites this paper
S1366728921000225jra 1..14
2021cites this paper
Evaluating the Efficacy of Summarization Evaluation across Languages
2021influential citation
Cultural and Geographical Influences on Image Translatability of Words across Languages
2021influential citation
Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources
2021cites this paper
Karamad: A Voice-based Crowdsourcing Platform for Underserved Populations
2021cites this paper
Survey of Low-Resource Machine Translation
2021cites this paper
Creativity on Paid Crowdsourcing Platforms
2020cites this paper
Bootstrapping a Crosslingual Semantic Parser
2020cites this paper
PMIndia - A Collection of Parallel Corpora of Languages of India
2020cites this paper
Findings of the 2020 Conference on Machine Translation (WMT20)
2020cites this paper
ParsiNLU: A Suite of Language Understanding Challenges for Persian
2020cites this paper
Development of a Free Online Interactive Naming Therapy for Bilingual Aphasia.
2020cites this paper
Neural Generation for Czech: Data and Baselines
2019cites this paper
Developing and validating a methodology for crowdsourcing L2 speech ratings in Amazon Mechanical Turk
2019cites this paper
Paid Crowdsourcing, Low Income Contributors, and Subjectivity
2019cites this paper
Using Sitcoms to Measure Humor Comprehension Between L1, L2, and Bilingual Users of English: Implications for Pragmatic Research
2019cites this paper
Text-to-Speech Synthesis Using Found Data for Low-Resource Languages
2019cites this paper
The emotion–valuation constellation: Multiple emotions are governed by a common grammar of social valuation
2019cites this paper
The Practice of Crowdsourcing
2019cites this paper
Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
2019cites this paper
Detecting and Mitigating Bias in Machine Learning Image Data through Semantic Description of the Attention Mechanism: The use-case Gender Bias in Profession Prediction from Images
2019cites this paper
Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview
2019cites this paper
Introduction to the special issue on annotated corpora
2019cites this paper
Personality Inventory DSM-5: A Spanish Translation for Hispanics in the United States
2019cites this paper
IMAGE-BASED BILINGUAL LEXICON INDUCTION FOR LOW RESOURCE LANGUAGES
2019influential citation
On Localizing Keywords in Continuous Speech using Mismatched Crowd
2018cites this paper
Opportunities and Challenges for Artificial Intelligence in India
2018cites this paper
Characterizing the Global Crowd Workforce: A Cross-Country Comparison of Crowdworker Demographics
2018influential citation
Supplemental Materials : Learning Translations via Images with a Massively Multilingual Image Dataset
2018influential citation
Learning Translations via Images: A Large Multilingual Dataset and Comprehensive Study
2018cites this paper
Crowdsourcing – A Step Towards Advanced Machine Learning
2018cites this paper
Exploring Stereotypes and Biased Data with the Crowd
2018cites this paper
X Aggregating Crowdsourced Labels in Subjective Domains
2018cites this paper
Text as social and cultural data : a computational perspective on variation in text
2017cites this paper
EFFECTS OF THREATS TO GROUPS ON INGROUP-PROSOCIAL BEHAVIORS AND ORIENTATIONS
2017cites this paper
Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge
2017cites this paper
Beyond Mechanical Turk: Using Techniques from Meta Learning to Compare Crowdsourcing Platforms Across Languages
2017cites this paper
Annotation of semantic roles for the Turkish Proposition Bank
2017cites this paper
Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research
2017cites this paper
Crowdsourcing Emotions in Music Domain
2017cites this paper
Learning Translations via Matrix Completion
2017influential citation
Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds
2017cites this paper
A Comprehensive Analysis of Bilingual Lexicon Induction
2017influential citation
The Language of Place: Semantic Value from Geospatial Context
2017cites this paper
Show Me How to Tie a Tie: Evaluation of Cross-Lingual Video Retrieval
2016cites this paper
LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages
2016cites this paper
Context, but not proficiency, moderates the effects of metaphor framing: A case study in India
2016cites this paper
Crowdsourcing method in empirical linguistic research : Chinese studies using mechanical turk-based experimentation
2016cites this paper
The governance of crowdsourcing: Rationalities of the new exploitation
2016cites this paper
Eliminating Fuzzy Duplicates in Crowdsourced Lexical Resources
2016cites this paper
Performance Improvement of Probabilistic Transcriptions with Language-specific Constraints
2016cites this paper
The Social Impact of Natural Language Processing
2016cites this paper
Gender and headedness in Spanish blends
2016cites this paper
End-to-end statistical machine translation with zero or small parallel texts†
2016cites this paper
Language coverage for mismatched crowdsourcing
2016cites this paper
Feasibility of Post-Editing Speech Transcriptions with a Mismatched Crowd
2016cites this paper
Semi-automatic Detection of Cross-lingual Marketing Blunders based on Pragmatic Label Propagation in Wiktionary
2016cites this paper
DBtrends: Publishing and Benchmarking RDF Ranking Functions
2016cites this paper
Performance Improvements of Probabilistic Transcript-adapted ASR with Recurrent Neural Network and Language-specific Constraints
2016cites this paper
YARN: Spinning-in-Progress
2016cites this paper
Mechanical Turk-based Experiment vs Laboratory-based Experiment: A Case Study on the Comparison of Semantic Transparency Rating Data
2015cites this paper
Transcribing continuous speech using mismatched crowdsourcing
2015cites this paper