Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

Dana Ruiter,D. Klakow,Josef van Genabith,C. España-Bonet

Published 2021 in Machine Translation Summit

ABSTRACT

For most language combinations and parallel data is either scarce or simply unavailable. To address this and unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising and while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To this date and the inclusion of UMT data generation techniques in SSNMT has not been investigated. We show that including UMT techniques into SSNMT significantly outperforms SSNMT (up to +4.3 BLEU and af2en) as well as statistical (+50.8 BLEU) and hybrid UMT (+51.5 BLEU) baselines on related and distantly-related and unrelated language pairs.

PUBLICATION RECORD

Publication year
2021
Venue
Machine Translation Summit
Publication date
2021-07-19
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 2107.08772
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Unsupervised Machine Translation On Dravidian Languages
2021cited by this paper
Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources
2021cited by this paper
MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation
2021cited by this paper
The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation
2021cited by this paper
Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
2020cited by this paper
When and Why is Unsupervised Neural Machine Translation Useless?
2020cited by this paper
When Does Unsupervised Machine Translation Work?
2020cited by this paper
Reference Language based Unsupervised Neural Machine Translation
2020cited by this paper
Low Resource Neural Machine Translation: A Benchmark for Five African Languages
2020cited by this paper
Multilingual Denoising Pre-training for Neural Machine Translation
2020cited by this paper
Dataset for comparable evaluation of machine translation between 11 South African languages
2020cited by this paper
Proceedings of the Fifth Conference on Machine Translation, WMT@EMNLP 2020, Online, November 19-20, 2020
2020cited by this paper
Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation
2020influential reference
Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution
2020cited by this paper
Self-Supervised Neural Machine Translation
2019influential reference
Unsupervised Neural Machine Translation with SMT as Posterior Regularization
2019influential reference
Cross-lingual Language Model Pretraining
2019cited by this paper
An Effective Approach to Unsupervised Machine Translation
2019influential reference
Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English
2019influential reference
Unsupervised Pivot Translation for Distant Languages
2019cited by this paper
A Focus on Neural Machine Translation for African Languages
2019cited by this paper
Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings
2019cited by this paper
Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders
2019cited by this paper
JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages
2019cited by this paper
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
2019cited by this paper
UdS-DFKI Participation at WMT 2019: Low-Resource (en-gu) and Coreference-Aware (en-de) Systems
2019cited by this paper
Ordering Matters: Word Ordering Aware Unsupervised NMT
2019cited by this paper
Unsupervised Cross-lingual Representation Learning at Scale
2019cited by this paper
A Massive Collection of Cross-Lingual Web-Document Pairs
2019cited by this paper
Unsupervised Neural Machine Translation with Weight Sharing
2018cited by this paper
Phrase-Based & Neural Unsupervised Machine Translation
2018cited by this paper
Joint Training for Neural Machine Translation Models with Monolingual Data
2018cited by this paper
A Call for Clarity in Reporting BLEU Scores
2018cited by this paper
Bi-Directional Neural Machine Translation with Synthetic Parallel Data
2018cited by this paper
Iterative Back-Translation for Neural Machine Translation
2018cited by this paper
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
2018cited by this paper
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
2018influential reference
Large Scale Myanmar to English Neural Machine Translation System
2018cited by this paper
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
2018cited by this paper
Unsupervised Machine Translation Using Monolingual Corpora Only
2017influential reference
Neural machine translation for low-resource languages without parallel corpora
2017cited by this paper
URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors
2017cited by this paper
Learning bilingual word embeddings with (almost) no bilingual data
2017cited by this paper
Unsupervised Neural Machine Translation
2017cited by this paper
Copied Monolingual Data Improves Low-Resource Neural Machine Translation
2017cited by this paper
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
2016cited by this paper
Transfer Learning for Low-Resource Neural Machine Translation
2016cited by this paper
Unsupervised Pretraining for Sequence to Sequence Learning
2016cited by this paper
Neural Machine Translation of Rare Words with Subword Units
2015influential reference
On Using Monolingual Corpora in Neural Machine Translation
2015cited by this paper
Improving Neural Machine Translation Models with Monolingual Data
2015influential reference
Distributed Representations of Words and Phrases and their Compositionality
2013cited by this paper
Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
2011cited by this paper
Improving Translation Model by Monolingual Data
2011cited by this paper
Moses: Open Source Toolkit for Statistical Machine Translation
2007cited by this paper
NLTK: The Natural Language Toolkit
2006cited by this paper
Statistical Significance Tests for Machine Translation Evaluation
2004cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper

CITED BY

Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction
2023cites this paper
Uncovering the Risks and Drawbacks Associated With the Use of Synthetic Data for Grammatical Error Correction
2023cites this paper
Boosting Unsupervised Machine Translation with Pseudo-Parallel Data
2023cites this paper
A Voyage on Neural Machine Translation for Indic Languages
2023cites this paper
Introducing "Forecast Utterance" for Conversational Data Science
2023cites this paper
Neural Machine Translation for Kashmiri to English and Hindi using Pre-trained Embeddings
2022cites this paper
Exploiting Social Media Content for Self-Supervised Style Transfer
2022cites this paper
The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation
2021cites this paper
Unsupervised Named Entity Recognition for Hi-Tech Domain
2021cites this paper