Machine Translation Robustness to Natural Asemantic Variation

Published 2022 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Current Machine Translation (MT) models still struggle with more challenging input, such as noisy data and tail-end words and phrases. Several works have addressed this robustness issue by identifying specific categories of noise and variation then tuning models to perform better on them. An important yet under-studied category involves minor variations in nuance (non-typos) that preserve meaning w.r.t. the target language. We introduce and formalize this category as Natural Asemantic Variation (NAV) and investigate it in the context of MT robustness. We find that existing MT models fail when presented with NAV data, but we demonstrate strategies to improve performance on NAV by fine-tuning them with human-generated variations. We also show that NAV robustness can be transferred across languages and find that synthetic perturbations can achieve some but not all of the benefits of organic NAV data.

PUBLICATION RECORD

Publication year
2022
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
2022-05-25
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.48550/arXiv.2205.12514 arXiv 2205.12514
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Robust Open-Vocabulary Translation from Visual Text Representations
2021cited by this paper
Analyzing the Forgetting Problem in Pretrain-Finetuning of Open-domain Dialogue Response Models
2021cited by this paper
High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics
2021cited by this paper
Evaluating Robustness to Input Perturbations for Neural Machine Translation
2020influential reference
Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation
2020cited by this paper
Simulated Multiple Reference Training Improves Low-Resource Machine Translation
2020cited by this paper
Beyond English-Centric Multilingual Machine Translation
2020cited by this paper
Simultaneous Translation and Paraphrase for Language Education
2020cited by this paper
Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems
2019cited by this paper
ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation
2019cited by this paper
Training on Synthetic Noise Improves Robustness to Natural Noise in Machine Translation
2019cited by this paper
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
2019cited by this paper
Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation
2019cited by this paper
Machine Translation of Restaurant Reviews: New Corpus for Domain Adaptation and Robustness
2019cited by this paper
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
2019cited by this paper
A Massive Collection of Cross-Lingual Web-Document Pairs
2019cited by this paper
A Call for Clarity in Reporting BLEU Scores
2018cited by this paper
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
2018cited by this paper
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation
2018cited by this paper
MTNT: A Testbed for Machine Translation of Noisy Text
2018cited by this paper
On the Impact of Various Types of Noise on Neural Machine Translation
2018cited by this paper
Attention is All you Need
2017cited by this paper
Synthetic and Natural Noise Both Break Neural Machine Translation
2017cited by this paper
Improving Neural Machine Translation Models with Monolingual Data
2015cited by this paper
Explaining and Harnessing Adversarial Examples
2014cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
of the Association for Computational Linguistics
year unknowncited by this paper

CITED BY

AGI: the illusion that distorts and distracts digital governance
2025cites this paper
Optimized Fine-tuning and Pseudo-Data Strategies for Cross-Domain Low-Resource Language Cantonese-English Neural Machine Translation
2025cites this paper