An Empirical Analysis of Machine Translation for Expanding Multilingual Benchmarks

Sara Rajaee,Rochelle Choenni,Ekaterina Shutova,C. Monz

Published 2025 in Proceedings of the Tenth Conference on Machine Translation

ABSTRACT

The rapid advancement of large language models (LLMs) has introduced new challenges in their evaluation, particularly for multilingual settings. The limited evaluation data are more pronounced in low-resource languages due to the scarcity of professional annotators, hindering fair progress across languages. In this work, we systematically investigate the viability of using machine translation (MT) as a proxy for evaluation in scenarios where human-annotated test sets are unavailable. Leveraging a state-of-the-art translation model, we translate datasets from four tasks into 198 languages and employ these translations to assess the quality and robustness of MT-based multilingual evaluation under different setups. We analyze task-specific error patterns, identifying when MT-based evaluation is reliable and when it produces misleading results. Our translated benchmark reveals that current language selections in multilingual datasets tend to overestimate LLM performance on low-resource languages. We conclude that although machine translation is not yet a fully reliable method for evaluating multilingual models, overlooking its potential means missing a valuable opportunity to track progress in non-English languages.

PUBLICATION RECORD

Publication year
2025
Venue
Proceedings of the Tenth Conference on Machine Translation
Publication date
Unknown publication date
Fields of study
Not labeled
Identifiers
DOI 10.18653/v1/2025.wmt-1.1
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Meaningful Pose-Based Sign Language Evaluation
2025cited by this paper
IRB-MT at WMT25 Translation Task: A Simple Agentic System Using an Off-the-Shelf LLM
2025cited by this paper
Transformers: Leveraging OpenNMT and Transfer Learning for Low-Resource Indian Language Translation
2025cited by this paper
Fine-tuning NMT Models and LLMs for Specialised EN-ES Translation Using Aligned Corpora, Glossaries, and Synthetic Data: MULTITAN at WMT25 Terminology Shared Task
2025cited by this paper
It Takes Two: A Dual Stage Approach for Terminology-Aware Translation
2025cited by this paper
RoCS-MT v2 at WMT 2025: Robust Challenge Set for Machine Translation
2025cited by this paper
Simple Test Time Scaling for Machine Translation: Kaze-MT at the WMT25 General Translation Task
2025cited by this paper
Findings of WMT 2025 Shared Task on Low-resource Indic Languages Translation
2025cited by this paper
RankedCOMET: Elevating a 2022 Baseline to a Top-5 Finish in the WMT 2025 QE Task
2025cited by this paper
Automated Evaluation for Terminology Translation Related to the EEA Agreement
2025cited by this paper
WMT 2025 CreoleMT Systems Description : Martinican Creole and French
2025cited by this paper
AMI at WMT25 General Translation Task: How Low Can We Go? Finetuning Lightweight Llama Models for Low Resource Machine Translation
2025cited by this paper
Evaluating WMT 2025 Metrics Shared Task Submissions on the SSA-MTE African Challenge Set
2025cited by this paper
DLUT and GTCOM’s Large Language Model Based Translation System for WMT25
2025cited by this paper
AkibaNLP-TUT: Injecting Language-Specific Word-Level Noise for Low-Resource Language Translation
2025cited by this paper
Tackling Low-Resource NMT with Instruction-Tuned LLaMA: A Study on Kokborok and Bodo
2025cited by this paper
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
2024cited by this paper
Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet
2024cited by this paper
Towards Multilingual LLM Evaluation for European Languages
2024cited by this paper
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
2024cited by this paper
Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability
2023cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023cited by this paper
Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet
2023cited by this paper
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources
2022cited by this paper
Understanding Translationese in Cross-Lingual Summarization
2022cited by this paper
Building Machine Translation Systems for the Next Thousand Languages
2022cited by this paper
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
2022cited by this paper
Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
2021cited by this paper
Few-shot Learning with Multilingual Language Models
2021cited by this paper
XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
2021cited by this paper
table
2021cited by this paper
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
2021cited by this paper
Neural Machine Translation for Low-resource Languages: A Survey
2021cited by this paper
XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation
2020cited by this paper
On Exposure Bias, Hallucination and Domain Shift in Neural Machine Translation
2020cited by this paper
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
2020cited by this paper
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
2020cited by this paper
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
2020cited by this paper
HuggingFace's Transformers: State-of-the-art Natural Language Processing
2019cited by this paper
PAWS: Paraphrase Adversaries from Word Scrambling
2019cited by this paper
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
2019cited by this paper
Unsupervised Cross-lingual Representation Learning at Scale
2019influential reference
On the Cross-lingual Transferability of Monolingual Representations
2019cited by this paper
PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
2019cited by this paper
The Effect of Translationese in Machine Translation Test Sets
2019cited by this paper
XNLI: Evaluating Cross-lingual Sentence Representations
2018cited by this paper
chrF++: words helping character n-grams
2017cited by this paper
An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation
2015cited by this paper
Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning
2011cited by this paper
Mirroring.
1977cited by this paper

CITED BY

No citing papers are available for this paper.