Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation

Malik Marmonier,Benoît Sagot,Rachel Bawden

Published 2026 in Unknown venue

ABSTRACT

This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE). The rapid adoption of Large Language Models (LLMs) into MT workflows is reshaping the research landscape, yet its impact on established quality prediction paradigms remains underexplored. We study this issue through a series of"hindsight"experiments on a unique, multi-candidate dataset resulting from a genuine MT post-editing (MTPE) project. The dataset consists of over 6,000 English source segments with nine translation hypotheses from a diverse set of traditional neural MT systems and advanced LLMs, all evaluated against a single, final human post-edited reference. Using Kendall's rank correlation, we assess the predictive power of source-side difficulty metrics, candidate-side QE models and position heuristics against two gold-standard scores: TER (as a proxy for post-editing effort) and COMET (as a proxy for human judgment). Our findings highlight that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-03-04
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 2603.04083
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Findings of the WMT 2025 Shared Task of the Open Language Data Initiative
2025cited by this paper
A French Version of the OLDI Seed Corpus
2025influential reference
How People Use ChatGPT
2025cited by this paper
Pitfalls and Outlooks in Using COMET
2024cited by this paper
Investigating Length Issues in Document-level Machine Translation
2024cited by this paper
À propos des difficultés de traduire automatiquement de longs documents
2024cited by this paper
MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task
2024cited by this paper
Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation
2023cited by this paper
Return to the Source: Assessing Machine Translation Suitability
2023cited by this paper
No Language Left Behind: Scaling Human-Centered Machine Translation
2022cited by this paper
CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task
2022cited by this paper
OPUS-MT – Building open translation services for the World
2020cited by this paper
COMET: A Neural Framework for MT Evaluation
2020cited by this paper
chrF: character n-gram F-score for automatic MT evaluation
2015cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
Sequence to Sequence Learning with Neural Networks
2014cited by this paper
Continuous Measurement Scales in Human Evaluation of Machine Translation
2013cited by this paper
Automatically Predicting Sentence Translation Difficulty
2013cited by this paper
Multidimensional Quality Metrics : A Flexible System for Assessing Translation Quality
2013cited by this paper
Findings of the 2012 Workshop on Statistical Machine Translation
2012cited by this paper
NLTK: The Natural Language Toolkit
2006cited by this paper
A Study of Translation Edit Rate with Targeted Human Annotation
2006cited by this paper
Confidence Estimation for Machine Translation
2004cited by this paper
Confidence estimation for translation prediction
2003cited by this paper
Confidence measures for statistical machine translation
2003cited by this paper
WordNet: A Lexical Database for English
1995cited by this paper
Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel
1975cited by this paper
A computer readability formula designed for machine scoring.
1975cited by this paper
SMOG Grading - A New Readability Formula.
1969cited by this paper
A new readability yardstick.
1948cited by this paper

CITED BY

No citing papers are available for this paper.