Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance

Published 2018 in Conference on Machine Translation

ABSTRACT

This paper presents the results of the WMT18 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT18 News Translation Task with automatic metrics. We collected scores of 10 metrics and 8 research groups. In addition to that, we computed scores of 8 standard metrics (BLEU, SentBLEU, chrF, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric’s scores correlate with WMT18 official manual ranking of systems) and in terms of segment-level correlation (how often a metric agrees with humans in judging the quality of a particular sentence relative to alternate outputs). This year, we employ a single kind of manual evaluation: direct assessment (DA).

PUBLICATION RECORD

Publication year
2018
Venue
Conference on Machine Translation
Publication date
2018-10-01
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/W18-6450
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Meteor++: Incorporating Copy Knowledge into Machine Translation Evaluation
2018cited by this paper
EvalD Reference-Less Discourse Evaluation for WMT18
2018influential reference
ITER: Improving Translation Edit Rate through Optimizable Edit Costs
2018cited by this paper
RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation
2018cited by this paper
Results of the WMT17 Metrics Shared Task
2017cited by this paper
MEANT 2.0: Accurate semantic MT evaluation for any output language
2017cited by this paper
Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task
2017cited by this paper
chrF++: words helping character n-grams
2017cited by this paper
UHH Submission to the WMT17 Metrics Shared Task
2017cited by this paper
Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics
2016cited by this paper
CharacTer: Translation Edit Rate on Character Level
2016cited by this paper
Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools and Data Sets to an Integrated Ecosystem”
2016cited by this paper
Ten Years of WMT Evaluation Campaigns: Lessons Learnt
2016cited by this paper
Can machine translation systems be evaluated by the crowd alone
2015cited by this paper
Accurate Evaluation of Segment-level Machine Translation Metrics
2015cited by this paper
BEER 1.1: ILLC UvA submission to metrics and tuning task
2015influential reference
chrF: character n-gram F-score for automatic MT evaluation
2015cited by this paper
Testing for Significance of Increased Correlation with Human Judgment
2014cited by this paper
Randomized Significance Tests in Machine Translation
2014cited by this paper
Meteor Universal: Language Specific Translation Evaluation for Any Target Language
2014cited by this paper
Is Machine Translation Getting Better over Time?
2014cited by this paper
Results of the WMT14 Metrics Shared Task
2014cited by this paper
Results of the WMT13 Metrics Shared Task
2013cited by this paper
Continuous Measurement Scales in Human Evaluation of Machine Translation
2013cited by this paper
A Grain of Salt for the WMT Manual Evaluation
2011cited by this paper
CDER: Efficient MT Evaluation Using Block Movements
2006cited by this paper
A Study of Translation Edit Rate with Targeted Human Annotation
2006cited by this paper
Manual and Automatic Evaluation of Machine Translation between European Languages
2006cited by this paper
Statistical Significance Tests for Machine Translation Evaluation
2004cited by this paper
The reliability of the ITU-t p.85 standard for the evaluation of text-to-speech systems
2002cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002influential reference
Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics
2002influential reference
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
2002cited by this paper

CITED BY

Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection
2025cites this paper
Lost in Translation? Found in Evaluation: A Comprehensive Survey on Sentence-Level Translation Evaluation
2025cites this paper
Reward Models are Metrics in a Trench Coat
2025cites this paper
Don't Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation
2025cites this paper
Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
2024cites this paper
Evaluating Automatic Metrics with Incremental Machine Translation Systems
2024influential citation
MT-Ranker: Reference-free machine translation evaluation by inter-system ranking
2024cites this paper
Is Context Helpful for Chat Translation Evaluation?
2024cites this paper
Constraints to Neural Machine Translation Quality, Human and Automated Evaluation, and Quality Improvement across Language Pairs: A Systematic Literature Review
2024cites this paper
Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics
2024cites this paper
Assessing the Role of Context in Chat Translation Evaluation: Is Context Helpful and Under What Conditions?
2024cites this paper
Improving Metrics for Speech Translation
2023cites this paper
TransFool: An Adversarial Attack against Neural Machine Translation Models
2023influential citation
Explaining with Contrastive Phrasal Highlighting: A Case Study in Assisting Humans to Detect Translation Differences
2023cites this paper
Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration
2023cites this paper
A Holistic Approach to Reference-Free Evaluation of Machine Translation
2023cites this paper
Obscurity-Quantified Curriculum Learning for Machine Translation Evaluation
2023cites this paper
Conformalizing Machine Translation Evaluation
2023cites this paper
Ties Matter: Modifying Kendall's Tau for Modern Metric Meta-Evaluation
2023cites this paper
Large Language Models are Diverse Role-Players for Summarization Evaluation
2023cites this paper
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
2022cites this paper
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes
2022cites this paper
COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task
2022cites this paper
Transformers Go for the LOLs: Generating (Humourous) Titles from Scientific Abstracts End-to-End
2022cites this paper
SEScore2: Retrieval Augmented Pretraining for Text Generation Evaluation
2022cites this paper
LENS: A Learnable Evaluation Metric for Text Simplification
2022influential citation
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
2022cites this paper
Reward Gaming in Conditional Text Generation
2022cites this paper
Subspace Representations for Soft Set Operations and Sentence Similarities
2022cites this paper
Alibaba-Translate China’s Submission for WMT 2022 Quality Estimation Shared Task
2022cites this paper
Alibaba-Translate China’s Submission for WMT2022 Metrics Shared Task
2022cites this paper
DATScore: Evaluating Translation with Data Augmented Translations
2022influential citation
NLG-Metricverse: An End-to-End Library for Evaluating Natural Language Generation
2022cites this paper
Out of the BLEU: how should we assess quality of the Code Generation models?
2022cites this paper
Reproducibility Issues for BERT-based Evaluation Metrics
2022influential citation
Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics
2022cites this paper
Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET
2022cites this paper
Better than Average: Paired Evaluation of NLP systems
2021cites this paper
Human evaluation of automatically generated text: Current trends and best practice guidelines
2021cites this paper
G ENIE A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
2021cites this paper
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
2021cites this paper
Evaluating the Morphosyntactic Well-formedness of Generated Texts
2021cites this paper
Macro-Average: Rare Types Are Important Too
2021cites this paper
Is This Translation Error Critical?: Classification-Based Human and Automatic Machine Translation Evaluation Focusing on Critical Errors
2021cites this paper
The statistical advantage of automatic NLG metrics at the system level
2021cites this paper
Variance-Aware Machine Translation Test Sets
2021cites this paper
BARTScore: Evaluating Generated Text as Text Generation
2021cites this paper
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
2021cites this paper
Multilingual Simultaneous Neural Machine Translation
2021cites this paper
Text Style Transfer: Leveraging a Style Classifier on Entangled Latent Representations
2021cites this paper
The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification
2021cites this paper
Automatic Text Evaluation through the Lens of Wasserstein Barycenters
2021cites this paper
A Large-Scale Study of Machine Translation in Turkic Languages
2021cites this paper
Sentiment-Aware Measure (SAM) for Evaluating Sentiment Transfer by Machine Translation Systems
2021cites this paper
Learning Compact Metrics for MT
2021cites this paper
A Survey of Automatic Text Summarization: Progress, Process and Challenges
2021cites this paper
Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain
2021cites this paper
Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task
2021cites this paper
GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation
2021influential citation
Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task
2020cites this paper
Evaluating Natural Language Generation via Unbalanced Optimal Transport
2020cites this paper
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
2020cites this paper
Unsupervised Quality Estimation for Neural Machine Translation
2020cites this paper
Neural Polysynthetic Language Modelling
2020cites this paper
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
2020cites this paper
NUBIA: NeUral Based Interchangeability Assessor for Text Generation
2020influential citation
Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing
2020cites this paper
Revisiting Round-trip Translation for Quality Estimation
2020influential citation
Transfer Learning for Digital Heritage Collections: Comparing Neural Machine Translation at the Subword-level and Character-level
2020cites this paper
Extended Study on Using Pretrained Language Models and YiSi-1 for Machine Translation Evaluation
2020cites this paper
Results of the WMT20 Metrics Shared Task
2020cites this paper
BLEURT: Learning Robust Metrics for Text Generation
2020influential citation
Machine Translation with Unsupervised Length-Constraints
2020cites this paper
AMR Similarity Metrics from Principles
2020cites this paper
Exploring Benefits of Transfer Learning in Neural Machine Translation
2020cites this paper
Reflective Decoding: Beyond Unidirectional Generation with Off-the-Shelf Language Models
2020cites this paper
Improving Semantic Similarity Calculation of Japanese Text for MT Evaluation
2020cites this paper
R EFLECTIVE D ECODING : U NSUPERVISED P ARAPHRASING AND A BDUCTIVE R EASONING
2020cites this paper
Neural Machine Translation between similar South-Slavic languages
2020cites this paper
TMUOU Submission for WMT20 Quality Estimation Shared Task
2020cites this paper
Neural Machine Translation for translating into Croatian and Serbian
2020cites this paper
Towards a Better Evaluation of Metrics for Machine Translation
2020cites this paper
Extracting correctly aligned segments from unclean parallel data using character n-gram matching
2020cites this paper
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
2020cites this paper
COMET: A Neural Framework for MT Evaluation
2020cites this paper
A Survey of Evaluation Metrics Used for NLG Systems
2020cites this paper
Character-Level Transformer-Based Neural Machine Translation
2020cites this paper
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
2019cites this paper
BERTScore: Evaluating Text Generation with BERT
2019cites this paper
Enabling Robust Grammatical Error Correction in New Domains: Data Sets, Metrics, and Analyses
2019cites this paper
On conducting better validation studies of automatic metrics in natural language generation evaluation
2019influential citation
Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges
2019cites this paper
Quality Estimation and Translation Metrics via Pre-trained Word and Sentence Embeddings
2019cites this paper
YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources
2019influential citation
Multimodal machine translation through visuals and speech
2019cites this paper
On the use of BERT for Neural Machine Translation
2019cites this paper
EED: Extended Edit Distance Measure for Machine Translation
2019cites this paper
LCEval: Learned Composite Metric for Caption Evaluation
2019cites this paper
Machine Translation Evaluation with BERT Regressor
2019influential citation
Machine Translation and the Evaluation of Its Quality
2019cites this paper