Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics

Published 2016 in North American Chapter of the Association for Computational Linguistics

ABSTRACT

Automatic Machine Translation metrics, such as BLEU, are widely used in empirical evaluation as a substitute for human assessment. Subsequently, the performance of a given metric is measured by its strength of correlation with human judgment. When a newly proposed metric achieves a stronger correlation over that of a baseline, it is important to take into account the uncertainty inherent in correlation point estimates prior to concluding improvements in metric performance. Confidence intervals for correlations with human judgment are rarely reported in metric evaluations, however, and when they have been reported, the most suitable methods have unfortunately not been applied. For example, incorrect assumptions about correlation sampling distributions made in past evaluations risk over-estimation of significant differences in metric performance. In this paper, we provide analysis of each of the issues that may lead to inaccuracies before providing detail of a method that overcomes previous challenges. Additionally, we propose a new method of translation sampling that in contrast achieves genuine high conclusivity in evaluation of the relative performance of metrics.

PUBLICATION RECORD

Publication year
2016
Venue
North American Chapter of the Association for Computational Linguistics
Publication date
2016-06-01
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/N16-1001
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Findings of the 2015 Workshop on Statistical Machine Translation
2015cited by this paper
Results of the WMT15 Metrics Shared Task
2015influential reference
Statistical Machine Translation
2014cited by this paper
Results of the WMT14 Metrics Shared Task
2014influential reference
RED, The DCU-CASICT Submission of Metrics Tasks
2014influential reference
Testing for Significance of Increased Correlation with Human Judgment
2014cited by this paper
Continuous Measurement Scales in Human Evaluation of Machine Translation
2013cited by this paper
Results of the WMT13 Metrics Shared Task
2013cited by this paper
Toward using confidence intervals to compare correlations.
2007influential reference
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
of the Association for Computational Linguistics
year unknowncited by this paper

CITED BY

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
2024cites this paper
Latest Research in Data Augmentation for Low Resource Language Text Translation: A Review
2024cites this paper
Evaluating the Efficacy of Length-Controllable Machine Translation
2023cites this paper
QAScore—An Unsupervised Unreferenced Metric for the Question Generation Evaluation
2022cites this paper
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
2022cites this paper
An Overview on Machine Translation Evaluation
2022cites this paper
Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics
2022influential citation
Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods
2021cites this paper
Q^{2}: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
2021cites this paper
Quantitative Analysis of Post-Editing Effort Indicators for NMT
2020cites this paper
BERTScore: Evaluating Text Generation with BERT
2019cites this paper
Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges
2019cites this paper
Enabling Robust Grammatical Error Correction in New Domains: Data Sets, Metrics, and Analyses
2019cites this paper
Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance
2018cites this paper
Alternative Objective Functions for Training MT Evaluation Metrics
2017cites this paper
Results of the WMT17 Metrics Shared Task
2017cites this paper
Results of the WMT16 Metrics Shared Task
2016cites this paper
Ten Years of WMT Evaluation Campaigns: Lessons Learnt
2016cites this paper
Re-evaluating Automatic Metrics for Image Captioning
2016cites this paper
Machine Translation Evaluation Resources and Methods: A Survey.
2016cites this paper
There’s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction
2016cites this paper