Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment

Philipp Koehn,Vishrav Chaudhary,Ahmed El-Kishky,Naman Goyal,Peng-Jen Chen,Francisco (Paco) Guzmán

Published 2020 in Conference on Machine Translation

ABSTRACT

Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018, 2019), we posed again the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting the highest-quality data to be used to train ma-chine translation systems. This year, the task tackled the low resource condition of Pashto–English and Khmer–English and also included the challenge of sentence alignment from document pairs.

PUBLICATION RECORD

Publication year
2020
Venue
Conference on Machine Translation
Publication date
Unknown publication date
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/2020.wmt-1.78
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance
2020cited by this paper
Multilingual Denoising Pre-training for Neural Machine Translation
2020cited by this paper
Filtering Noisy Parallel Corpus using Transformers with Proxy Task Learning
2020influential reference
An exploratory approach to the Parallel Corpus Filtering shared task WMT20
2020influential reference
Bicleaner at WMT 2020: Universitat d’Alacant-Prompsit’s submission to the parallel corpus filtering shared task
2020influential reference
Dual Conditional Cross Entropy Scores and LASER Similarity Scores for the WMT20 Parallel Corpus Filtering Shared Task
2020influential reference
Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions
2020influential reference
Volctrans Parallel Corpus Filtering System for WMT 2020
2020influential reference
Beyond English-Centric Multilingual Machine Translation
2020cited by this paper
Searching the Web for Cross-lingual Parallel Data
2020cited by this paper
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
2020cited by this paper
Exploiting Sentence Order in Document Alignment
2020cited by this paper
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
2019cited by this paper
Reinforcement Learning based Curriculum Optimization for Neural Machine Translation
2019cited by this paper
Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English
2019influential reference
A Massive Collection of Cross-Lingual Web-Document Pairs
2019cited by this paper
Vecalign: Improved Sentence Alignment in Linear Time and Space
2019influential reference
Unsupervised Cross-lingual Representation Learning at Scale
2019cited by this paper
Quality and Coverage: The AFRL Submission to the WMT19 Parallel Corpus Filtering for Low-Resource Conditions Task
2019cited by this paper
YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources
2019cited by this paper
Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
2019cited by this paper
Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings
2019cited by this paper
Massively Multilingual Neural Machine Translation
2019cited by this paper
On the Impact of Various Types of Noise on Neural Machine Translation
2018cited by this paper
Phrase-Based & Neural Unsupervised Machine Translation
2018cited by this paper
A Call for Clarity in Reporting BLEU Scores
2018cited by this paper
Filtering and Mining Parallel Data in a Joint Multilingual Space
2018cited by this paper
Iterative Back-Translation for Neural Machine Translation
2018cited by this paper
Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora
2018cited by this paper
Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection
2018cited by this paper
Alibaba Submission to the WMT18 Parallel Corpus Filtering Task
2018influential reference
Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering
2018cited by this paper
Attention is All you Need
2017cited by this paper
Findings of the 2017 Conference on Machine Translation (WMT17)
2017cited by this paper
Six Challenges for Neural Machine Translation
2017cited by this paper
Synthetic and Natural Noise Both Break Neural Machine Translation
2017cited by this paper
Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation
2017cited by this paper
Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora
2017cited by this paper
Dynamic Data Selection for Neural Machine Translation
2017cited by this paper
Findings of the WMT 2016 Bilingual Document Alignment Shared Task
2016cited by this paper
A Convolutional Encoder Model for Neural Machine Translation
2016cited by this paper
The United Nations Parallel Corpus v1.0
2016cited by this paper
First Steps Towards Coverage-Based Document Alignment
2016cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
Dirt Cheap Web-Scale Parallel Text from the Common Crawl
2013cited by this paper
Bilingual Data Cleaning for SMT using Graph-based Random Walk
2013cited by this paper
Parallel Data, Tools and Interfaces in OPUS
2012cited by this paper
Domain Adaptation via Pseudo In-Domain Data Selection
2011cited by this paper
Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text
2011cited by this paper
Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
2011cited by this paper
Parallel Corpus Refinement as an Outlier Detection Algorithm
2011cited by this paper
The Sentence-Aligned European Patent Corpus
2011cited by this paper
MT Detection in Web-Scraped Parallel Corpora
2011cited by this paper
Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora
2010cited by this paper
MT-based Sentence Alignment for OCR-generated Parallel Texts
2010cited by this paper
United Nations General Assembly Resolutions: A Six-Language Parallel Corpus
2009cited by this paper
Parallel corpora for medium density languages
2007cited by this paper
Europarl: A Parallel Corpus for Statistical Machine Translation
2005cited by this paper
Fast and accurate sentence alignment of bilingual corpora
2002cited by this paper
Mining the Web for Bilingual Text
1999cited by this paper
Aligning Sentences in Bilingual Corpora Using Lexical Information
1993cited by this paper
A Program for Aligning Sentences in Bilingual Corpora
1993cited by this paper

CITED BY

A Comparative Approach for Auditing Multilingual Phonetic Transcript Archives
2025cites this paper
Granary: Speech Recognition and Translation Dataset in 25 European Languages
2025cites this paper
Call for Rigor in Reporting Quality of Instruction Tuning Data
2025cites this paper
Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
2025cites this paper
Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora
2025cites this paper
Improving Parallel Sentence Mining for Low-Resource and Endangered Languages
2025cites this paper
AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts
2025cites this paper
Multilingual Data Filtering using Synthetic Data from Large Language Models
2025cites this paper
Man vs. machine: can AI outperform ESL student translations?
2025cites this paper
Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
2024cites this paper
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
2024cites this paper
Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages
2024cites this paper
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
2024cites this paper
Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset
2024cites this paper
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization
2024cites this paper
Optimizing Machine Translation Algorithms through Empirical Study of Multi-modal Information Fusion
2024cites this paper
Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
2024cites this paper
Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
2024cites this paper
Hyper-BTS Dataset: Scalability and Enhanced Analysis of Back TranScription (BTS) for ASR Post-Processing
2024cites this paper
Improving Machine Translation with Phrase Pair Injection and Corpus Filtering
2023cites this paper
DMOps: Data Management Operation and Recipes
2023cites this paper
Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural Language Processing Leaderboards
2023cites this paper
Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability
2023cites this paper
Filtering Matters: Experiments in Filtering Training Sets for Machine Translation
2023cites this paper
Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction
2023cites this paper
Reinforced Self-Training (ReST) for Language Modeling
2023influential citation
Uncovering the Risks and Drawbacks Associated With the Use of Synthetic Data for Grammatical Error Correction
2023cites this paper
Return to the Source: Assessing Machine Translation Suitability
2023cites this paper
Research on the Application of Translation Parallel Corpus in Interpretation Teaching
2023cites this paper
Translation Performance from the User's Perspective of Large Language Models and Neural Machine Translation Systems
2023cites this paper
Ask Language Model to Clean Your Noisy Translation Data
2023cites this paper
There’s No Data like Better Data: Using QE Metrics for MT Data Filtering
2023cites this paper
Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text
2023cites this paper
On-the-Fly Fusion of Large Language Models and Machine Translation
2023cites this paper
Aligning Neural Machine Translation Models: Human Feedback in Training and Inference
2023cites this paper
A Sentence Alignment Approach to Document Alignment and Multi-faceted Filtering for Curating Parallel Sentence Pairs from Web-crawled Data
2023cites this paper
Findings of the WMT 2023 Shared Task on Parallel Data Curation
2023cites this paper
A Fast Method to Filter Noisy Parallel Data WMT2023 Shared Task on Parallel Data Curation
2023cites this paper
Data Pruning for Efficient Model Pruning in Neural Machine Translation
2023cites this paper
Parallel Corpus Filtering for Japanese Text Simplification
2022cites this paper
Can Synthetic Translations Improve Bitext Quality?
2022cites this paper
Building Machine Translation Systems for the Next Thousand Languages
2022cites this paper
Detecting Various Types of Noise for Neural Machine Translation
2022cites this paper
Human evaluation of web-crawled parallel corpora for machine translation
2022cites this paper
Unsupervised Geometric and Topological Approaches for Cross-Lingual Sentence Representation and Comparison
2022cites this paper
A Method of Chinese-Vietnamese Bilingual Corpus Construction for Machine Translation
2022cites this paper
Mismatching-aware unsupervised translation quality estimation for low-resource languages
2022influential citation
Bitext Mining for Low-Resource Languages via Contrastive Learning
2022influential citation
A Comparison of Data Filtering Methods for Neural Machine Translation
2022cites this paper
Improving Translation of Out Of Vocabulary Words using Bilingual Lexicon Induction in Low-Resource Machine Translation
2022cites this paper
Quality versus Quantity: Building Catalan-English MT Resources
2022influential citation
Bicleaner AI: Bicleaner Goes Neural
2022cites this paper
Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages
2022cites this paper
Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World
2022cites this paper
The ASR Post-Processor Performance Challenges of BackTranScription (BTS): Data-Centric and Model-Centric Approaches
2022cites this paper
Multilingual Representation Distillation with Contrastive Learning
2022cites this paper
Overview of the 9th Workshop on Asian Translation
2022influential citation
Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages
2022cites this paper
Norm-based Noisy Corpora Filtering and Refurbishing in Neural Machine Translation
2022cites this paper
Empirical Evaluation of Language Agnostic Filtering of Parallel Data for Low Resource Languages
2022cites this paper
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
2021cites this paper
Recent Automatic Post Editing Research
2021cites this paper
BackTranScription (BTS)-based Jeju Automatic Speech Recognition Post-processor Research
2021cites this paper
On the Development of Customized Neural Machine Translation Models
2021cites this paper
Selecting the Best Data Filtering Method for NMT Training
2021cites this paper
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation
2021influential citation
How should human translation coexist with NMT? Efficient tool for building high quality parallel corpus
2021cites this paper
Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC
2021influential citation
EdinSaar@WMT21: North-Germanic Low-Resource Multilingual NMT
2021cites this paper
On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation
2021cites this paper
Survey of Low-Resource Machine Translation
2021cites this paper
Surprise Language Challenge: Developing a Neural Machine Translation System between Pashto and English in Two Months
2021influential citation
BTS: Back TranScription for Speech-to-Text Post-Processor using Text-to-Speech-to-Text
2021cites this paper
Beyond Noise: Mitigating the Impact of Fine-grained Semantic Divergences on Neural Machine Translation
2021cites this paper
Data Filtering using Cross-Lingual Word Embeddings
2021cites this paper
Findings of the 2020 Conference on Machine Translation (WMT20)
2020influential citation
Results of the WMT20 Metrics Shared Task
2020cites this paper
An exploratory approach to the Parallel Corpus Filtering shared task WMT20
2020cites this paper
Improving Parallel Data Identification using Iteratively Refined Sentence Alignments and Bilingual Mappings of Pre-trained Language Models
2020cites this paper