(Meta-) Evaluation of Machine Translation

Chris Callison-Burch,C. Fordyce,Philipp Koehn,Christof Monz,Josh Schroeder

Published 2007 in WMT@ACL

ABSTRACT

This paper evaluates the translation quality of machine translation systems for 8 language pairs: translating French, German, Spanish, and Czech to English and back. We carried out an extensive human evaluation which allowed us not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process. We measured timing and intra- and inter-annotator agreement for three types of subjective evaluation. We measured the correlation of automatic evaluation metrics with human judgments. This meta-evaluation reveals surprising facts about the most commonly used methodologies.

PUBLICATION RECORD

Publication year
2007
Venue
WMT@ACL
Publication date
2007-06-23
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.3115/1626355.1626373
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Paraphrasing and translation
2007cited by this paper
Word Error Rates: Decomposition over POS classes and Applications for Error Analysis
2007cited by this paper
Domain Adaptation in Statistical Machine Translation with Mixture Modelling
2007cited by this paper
Building a Statistical Machine Translation System for French Using the Europarl Corpus
2007cited by this paper
English-to-Czech Factored Machine Translation
2007cited by this paper
The ISL Phrase-Based MT System for the 2007 ACL Workshop on Statistical Machine Translation
2007cited by this paper
Analysis of Statistical and Morphological Classes to Generate Weigthed Reordering Hypotheses on a Statistical Machine Translation System
2007cited by this paper
Statistical Post-Editing on SYSTRAN‘s Rule-Based Translation System
2007cited by this paper
METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments
2007cited by this paper
Rule-Based Translation with Statistical Phrase-Based Post-Editing
2007cited by this paper
Factored Translation Models
2007influential reference
Source-Language Features and Maximum Correlation Training for Machine Translation Evaluation
2007cited by this paper
The Syntax Augmented MT (SAMT) System at the Shared Task for the 2007 ACL Workshop on Statistical Machine Translation
2007cited by this paper
Multi-Engine Machine Translation with an Open-Source SMT Decoder
2007cited by this paper
NRC‘s PORTAGE System for WMT 2007
2007cited by this paper
Getting to Know Moses: Initial Experiments on German-English Factored Translation
2007cited by this paper
UCB System Description for the WMT 2007 Shared Task
2007cited by this paper
The “Noisier Channel”: Translation from Morphologically Complex Languages
2007cited by this paper
Linguistic Features for Automatic Evaluation of Heterogenous MT Systems
2007cited by this paper
A Study of Translation Edit Rate with Targeted Human Annotation
2006influential reference
Manual and Automatic Evaluation of Machine Translation between European Languages
2006cited by this paper
MT Evaluation: Human-Like vs. Human Acceptable
2006cited by this paper
Re-evaluating Machine Translation Results with Paraphrase Support
2006influential reference
CzEng: Czech-English Parallel Corpus release version 0.5
2006cited by this paper
Overview of the IWSLT06 evaluation campaign
2006cited by this paper
Re-evaluating the Role of Bleu in Machine Translation Research
2006cited by this paper
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
2005cited by this paper
NIST 2005 machine translation evaluation official results
2005cited by this paper
Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French
2005cited by this paper
Overview of the IWSLT 2005 Evaluation Campaign
2005cited by this paper
Paraphrasing with Bilingual Parallel Corpora
2005cited by this paper
Measuring Translation Quality by Testing English Speakers with a New Defense Language Proficiency Test for Arabic
2005cited by this paper
Morphology and Reranking for the Statistical Parsing of Spanish
2005cited by this paper
A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION
2005cited by this paper
Europarl: A Parallel Corpus for Statistical Machine Translation
2005cited by this paper
What to Do When Lexicalization Fails: Parsing German with Suffix Analysis and Smoothing
2005cited by this paper
Shared Task: Statistical Machine Translation between European Languages
2005cited by this paper
The Alignment Template Approach to Statistical Machine Translation
2004cited by this paper
Confidence Estimation for Machine Translation
2004cited by this paper
Problems of Inducing Large Coverage Constraint-Based Dependency Grammar for Czech
2004cited by this paper
Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models
2004cited by this paper
A Systematic Comparison of Various Statistical Alignment Models
2003cited by this paper
Correlating automated and human assessments of machine translation quality
2003cited by this paper
Statistical Phrase-Based Translation
2003cited by this paper
Precision and Recall of Machine Translation
2003influential reference
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
2002cited by this paper
Design of a multi-lingual, parallel-processing statistical parsing engine
2002cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
Assessing Agreement on Classification Tasks: The Kappa Statistic
1996cited by this paper
The measurement of observer agreement for categorical data.
1977cited by this paper
Edinburgh Research Explorer Experiments in Domain Adaptation for Statistical Machine Translation
year unknowncited by this paper

CITED BY

BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context
2025cites this paper
Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data
2025cites this paper
Feeding Two Birds or Favoring One? Adequacy-Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation
2025influential citation
When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content
2025cites this paper
A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems
2025cites this paper
Google Translate or ChatGPT-4? A Multi-Metric Evaluation of Chinese-to-English Technical Translation
2025cites this paper
Enhancing Human Evaluation in Machine Translation with Comparative Judgment
2025cites this paper
Reward Models are Metrics in a Trench Coat
2025cites this paper
Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
2025cites this paper
Standards for Assessing the Quality of Political Discourse Translation Using Large Language Models
2025cites this paper
Preference Grammars and Decoding Algorithms for Probabilistic Synchronous Context Free Grammar Based Translation
2024cites this paper
A Probability–Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors
2024cites this paper
Translation Evaluation of Social Media Auto-Translate Feature toward News Headline
2024cites this paper
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
2024cites this paper
Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
2024cites this paper
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems
2024cites this paper
Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet
2024cites this paper
Label-Efficient Model Selection for Text Generation
2024cites this paper
Machine Translation with Large Language Models: Prompt Engineering for Persian, English, and Russian Directions
2024cites this paper
Neural Methods for Data-to-text Generation
2024cites this paper
Comparison of Machine Translation Services in the Biomedical Context
2024cites this paper
Calibration and context in human evaluation of machine translation
2024cites this paper
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
2024cites this paper
Evaluation Briefs: Drawing on Translation Studies for Human Evaluation of MT
2024cites this paper
Simpson's Paradox and the Accuracy-Fluency Tradeoff in Translation
2024cites this paper
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability, Reproducibility, and Practicality
2024cites this paper
Consistency is Key: Disentangling Label Variation in Natural Language Processing with Intra-Annotator Agreement
2023cites this paper
Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics
2023cites this paper
How can we measure machine translation quality?. Session 5 - Quality in Translation
2023cites this paper
AITA Generating Moral Judgements of the Crowd with Reasoning
2023cites this paper
A quality assessment of Korean–English patent machine translation
2023cites this paper
Text Style Transfer Evaluation Using Large Language Models
2023cites this paper
Text Style Transfer Back-Translation
2023cites this paper
Revisiting Sentence Union Generation as a Testbed for Text Consolidation
2023cites this paper
How to do human evaluation: A brief introduction to user studies in NLP
2023cites this paper
It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance
2023influential citation
Large Language Models Evaluate Machine Translation via Polishing
2023cites this paper
Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks
2023cites this paper
Less is More: Mitigate Spurious Correlations for Open-Domain Dialogue Response Generation Models by Causal Discovery
2023cites this paper
Comparison of machine translations (MT) technology; statistical (SMT) vs. neural (NMT)
2023cites this paper
Data Sampling and (In)stability in Machine Translation Evaluation
2023cites this paper
Angler: Helping Machine Translation Practitioners Prioritize Model Improvements
2023cites this paper
‘I am not a number’: on quantification and algorithmic norms in translation
2023cites this paper
Metric Score Landscape Challenge (MSLC23): Understanding Metrics’ Performance on a Wider Landscape of Translation Quality
2023cites this paper
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
2023cites this paper
adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM Playgrounds
2023cites this paper
Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet
2023cites this paper
Learning the Legibility of Visual Text Perturbations
2023cites this paper
Evaluating Reading Comprehension Exercises Generated by LLMs: A Showcase of ChatGPT in Education Applications
2023cites this paper
Assessing MT with measures of PE effort
2023cites this paper
LENS: A Learnable Evaluation Metric for Text Simplification
2022cites this paper
The Authenticity Gap in Human Evaluation
2022cites this paper
Extrinsic Evaluation of Machine Translation Metrics
2022cites this paper
AARGH! End-to-end Retrieval-Generation for Task-Oriented Dialog
2022cites this paper
Innovations in Neural Data-to-text Generation
2022cites this paper
Human Evaluation of English-Irish Transformer-Based NMT
2022cites this paper
Reliable and Safe Use of Machine Translation in Medical Settings
2022cites this paper
Evaluating Machine Translation in Cross-lingual E-Commerce Search
2022cites this paper
Findings of the 2022 Conference on Machine Translation (WMT22)
2022cites this paper
Consistent Human Evaluation of Machine Translation across Language Pairs
2022cites this paper
High-Resource Methodological Bias in Low-Resource Investigations
2022cites this paper
Understanding Translationese in Cross-Lingual Summarization
2022cites this paper
How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory
2022cites this paper
An Overview on Machine Translation Evaluation
2022influential citation
Perceptual Quality Dimensions of Machine-Generated Text with a Focus on Machine Translation
2022cites this paper
MOOC Coursera Content Post-editing
2022cites this paper
A Linguistically Motivated Test Suite to Semi-Automatically Evaluate German–English Machine Translation Output
2022cites this paper
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
2021cites this paper
ICE: Information coverage estimate for automatic evaluation abstractive summaries
2021cites this paper
A Critique of Statistical Machine Translation
2021cites this paper
Evaluation of Translation Technology
2021cites this paper
Recent Progress, Emerging Techniques, and Future Research Prospects of Bangla Machine Translation: A Systematic Review
2021cites this paper
Survey of Low-Resource Machine Translation
2021cites this paper
A Review of Human Evaluation for Style Transfer
2021cites this paper
All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
2021cites this paper
Flamingos and Hedgehogs in the Croquet-Ground: Teaching Evaluation of NLP Systems for Undergraduate Students
2021cites this paper
Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods
2021cites this paper
Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models
2021cites this paper
Findings of the 2021 Conference on Machine Translation (WMT21)
2021cites this paper
Is This Translation Error Critical?: Classification-Based Human and Automatic Machine Translation Evaluation Focusing on Critical Errors
2021cites this paper
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
2021influential citation
Considering Machine Translation (MT) as an Aid or a Threat to the Human Translator:
2021cites this paper
Think Before You Speak: Explicitly Generating Implicit Commonsense Knowledge for Response Generation
2021cites this paper
Multi-Domain Adaptation in Neural Machine Translation Through Multidimensional Tagging
2021cites this paper
Macro-Average: Rare Types Are Important Too
2021cites this paper
Out of the BLEU: An Error Analysis of Statistical and Neural Machine Translation of WikiHow Articles from English into Arabic
2021cites this paper
Understanding Human Potentials for Evaluating Generative Models
2021influential citation
How to Do Human Evaluation: Best Practices for User Studies in NLP
2021cites this paper
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
2021cites this paper
Evaluating RBMT output for -ing forms: A study of four tar-get languages
2021cites this paper
Dynamic Human Evaluation for Relative Model Comparisons
2021cites this paper
Grammar Accuracy Evaluation (GAE): Quantifiable Qualitative Evaluation of Machine Translation Models
2021cites this paper
Grammar Accuracy Evaluation (GAE): Quantifiable Intrinsic Evaluation of Machine Translation Models
2021cites this paper
Translation technology and learner performance: Professionally-oriented translation quality assessment with three translation technologies
2020cites this paper
A Set of Recommendations for Assessing Human-Machine Parity in Language Translation
2020cites this paper
Go Figure! A Meta Evaluation of Factuality in Summarization
2020cites this paper
Item Response Theory for Efficient Human Evaluation of Chatbots
2020cites this paper
Relations between comprehensibility and adequacy errors in machine translation output
2020cites this paper
Findings of the 2020 Conference on Machine Translation (WMT20)
2020cites this paper
GO FIGURE: A Meta Evaluation of Factuality in Summarization
2020cites this paper