Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions

Philipp Koehn,Francisco (Paco) Guzmán,Vishrav Chaudhary,J. Pino

Published 2019 in Conference on Machine Translation

ABSTRACT

Following the WMT 2018 Shared Task on Parallel Corpus Filtering, we posed the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting 2% and 10% of the highest-quality data to be used to train machine translation systems. This year, the task tackled the low resource condition of Nepali-English and Sinhala-English. Eleven participants from companies, national research labs, and universities participated in this task.

PUBLICATION RECORD

Publication year
2019
Venue
Conference on Machine Translation
Publication date
Unknown publication date
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/W19-5404
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English
2019influential reference
Noisy Parallel Corpus Filtering through Projected Word Embeddings
2019cited by this paper
NRC Parallel Corpus Filtering System for WMT 2019
2019influential reference
Webinterpret Submission to the WMT2019 Shared Task on Parallel Corpus Filtering
2019cited by this paper
Parallel Corpus Filtering Based on Fuzzy String Matching
2019influential reference
Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data
2019cited by this paper
Quality and Coverage: The AFRL Submission to the WMT19 Parallel Corpus Filtering for Low-Resource Conditions Task
2019influential reference
Filtering of Noisy Parallel Corpora Based on Hypothesis Generation
2019influential reference
The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task
2019influential reference
Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings
2019influential reference
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
2019cited by this paper
Reinforcement Learning based Curriculum Optimization for Neural Machine Translation
2019cited by this paper
UTFPR at WMT 2018: Minimalistic Supervised Corpora Filtering for Machine Translation
2018cited by this paper
A Call for Clarity in Reporting BLEU Scores
2018cited by this paper
On the Impact of Various Types of Noise on Neural Machine Translation
2018cited by this paper
Effective Parallel Corpus Mining using Bilingual Sentence Embeddings
2018cited by this paper
Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora
2018cited by this paper
Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection
2018cited by this paper
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
2018cited by this paper
Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task
2018cited by this paper
Alibaba Submission to the WMT18 Parallel Corpus Filtering Task
2018cited by this paper
SYSTRAN Participation to the WMT2018 Shared Task on Parallel Corpus Filtering
2018cited by this paper
Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering
2018influential reference
Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task
2018cited by this paper
An Unsupervised System for Parallel Corpus Filtering
2018cited by this paper
The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task
2018cited by this paper
Tilde’s Parallel Corpus Filtering Methods for WMT 2018
2018cited by this paper
The Speechmatics Parallel Corpus Filtering System for WMT18
2018cited by this paper
H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings
2018cited by this paper
Attention is All you Need
2017cited by this paper
Synthetic and Natural Noise Both Break Neural Machine Translation
2017cited by this paper
The IIT Bombay English-Hindi Parallel Corpus
2017cited by this paper
Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora
2017cited by this paper
Six Challenges for Neural Machine Translation
2017influential reference
Dynamic Data Selection for Neural Machine Translation
2017cited by this paper
Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation
2017cited by this paper
Findings of the WMT 2016 Bilingual Document Alignment Shared Task
2016cited by this paper
The United Nations Parallel Corpus v1.0
2016cited by this paper
A Convolutional Encoder Model for Neural Machine Translation
2016cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
The Language Demographics of Amazon Mechanical Turk
2014cited by this paper
Bilingual Data Cleaning for SMT using Graph-based Random Walk
2013cited by this paper
Parallel Data, Tools and Interfaces in OPUS
2012cited by this paper
Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
2011cited by this paper
The Sentence-Aligned European Patent Corpus
2011cited by this paper
Domain Adaptation via Pseudo In-Domain Data Selection
2011cited by this paper
Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text
2011cited by this paper
Parallel Corpus Refinement as an Outlier Detection Algorithm
2011cited by this paper
MT Detection in Web-Scraped Parallel Corpora
2011cited by this paper
United Nations General Assembly Resolutions: A Six-Language Parallel Corpus
2009cited by this paper
Parallel corpora for medium density languages
2007cited by this paper
Moses: Open Source Toolkit for Statistical Machine Translation
2007influential reference
Europarl: A Parallel Corpus for Statistical Machine Translation
2005cited by this paper
Mining the Web for Bilingual Text
1999cited by this paper

CITED BY

Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora
2025cites this paper
Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
2025cites this paper
Improving Parallel Sentence Mining for Low-Resource and Endangered Languages
2025cites this paper
Multilingual Data Filtering using Synthetic Data from Large Language Models
2025cites this paper
Improving BERTScore for Machine Translation Evaluation Through Contrastive Learning
2024cites this paper
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
2024cites this paper
Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
2024cites this paper
Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages
2024cites this paper
Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset
2024cites this paper
There’s No Data like Better Data: Using QE Metrics for MT Data Filtering
2023cites this paper
A Sentence Alignment Approach to Document Alignment and Multi-faceted Filtering for Curating Parallel Sentence Pairs from Web-crawled Data
2023cites this paper
Findings of the WMT 2023 Shared Task on Parallel Data Curation
2023cites this paper
Automatic language identification: a case study of Pahari languages
2023cites this paper
Filtering Matters: Experiments in Filtering Training Sets for Machine Translation
2023cites this paper
Mining parallel sentences from internet with multi-view knowledge distillation for low-resource language pairs
2023cites this paper
A Fast Method to Filter Noisy Parallel Data WMT2023 Shared Task on Parallel Data Curation
2023cites this paper
Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text
2023cites this paper
On-the-Fly Fusion of Large Language Models and Machine Translation
2023cites this paper
Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages
2022cites this paper
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
2022cites this paper
Can Synthetic Translations Improve Bitext Quality?
2022cites this paper
Detecting Various Types of Noise for Neural Machine Translation
2022cites this paper
Human evaluation of web-crawled parallel corpora for machine translation
2022cites this paper
MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases
2022cites this paper
UM4: Unified Multilingual Multiple Teacher-Student Model for Zero-Resource Neural Machine Translation
2022cites this paper
Bitext Mining for Low-Resource Languages via Contrastive Learning
2022cites this paper
Bicleaner AI: Bicleaner Goes Neural
2022cites this paper
Multilingual Representation Distillation with Contrastive Learning
2022cites this paper
Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages
2022cites this paper
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources
2022cites this paper
Optimal Transport for Unsupervised Hallucination Detection in Neural Machine Translation
2022cites this paper
Empirical Evaluation of Language Agnostic Filtering of Parallel Data for Low Resource Languages
2022cites this paper
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation
2021influential citation
Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC
2021cites this paper
Self-Supervised Knowledge Assimilation for Expert-Layman Text Style Transfer
2021cites this paper
Survey of Low-Resource Machine Translation
2021cites this paper
Neural Machine Translation for Low-resource Languages: A Survey
2021cites this paper
Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment
2021cites this paper
Beyond Noise: Mitigating the Impact of Fine-grained Semantic Divergences on Neural Machine Translation
2021cites this paper
Data Filtering using Cross-Lingual Word Embeddings
2021cites this paper
Findings of the Shared Task on Machine Translation in Dravidian languages
2021cites this paper
A Targeted Attack on Black-Box Neural Machine Translation with Parallel Data Poisoning
2020influential citation
PMIndia - A Collection of Parallel Corpora of Languages of India
2020cites this paper
Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages
2020cites this paper
Exploiting Sentence Order in Document Alignment
2020cites this paper
Multilingual Unsupervised Sentence Simplification
2020cites this paper
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
2020cites this paper
Octanove Labs’ Japanese-Chinese Open Domain Translation System
2020cites this paper
OpusFilter: A Configurable Parallel Corpus Filtering Toolbox
2020cites this paper
Searching the Web for Cross-lingual Parallel Data
2020cites this paper
Bifixer and Bicleaner: two open-source tools to clean your parallel data
2020cites this paper
Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation
2020cites this paper
Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus
2020cites this paper
Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank
2020cites this paper
Unsupervised Expressive Rules Provide Explainability and Assist Human Experts Grasping New Domains
2020cites this paper
Beyond English-Centric Multilingual Machine Translation
2020cites this paper
Decoding Strategies for Improving Low-Resource Machine Translation
2020cites this paper
Volctrans Parallel Corpus Filtering System for WMT 2020
2020cites this paper
Machine Translation of Open Educational Resources: Evaluating Translation Quality and the Transition to Neural Machine Translation
2020cites this paper
Targeted Poisoning Attacks on Black-Box Neural Machine Translation
2020influential citation
Detecting Hallucinated Content in Conditional Neural Sequence Generation
2020cites this paper
Extracting correctly aligned segments from unclean parallel data using character n-gram matching
2020cites this paper
Self-Supervised Learning for Pairwise Data Refinement
2020cites this paper
Bicleaner at WMT 2020: Universitat d’Alacant-Prompsit’s submission to the parallel corpus filtering shared task
2020cites this paper
Samsung R&D Institute Poland submission to WMT20 News Translation Task
2020cites this paper
Improving Parallel Data Identification using Iteratively Refined Sentence Alignments and Bilingual Mappings of Pre-trained Language Models
2020cites this paper
An exploratory approach to the Parallel Corpus Filtering shared task WMT20
2020cites this paper
Dual Conditional Cross Entropy Scores and LASER Similarity Scores for the WMT20 Parallel Corpus Filtering Shared Task
2020cites this paper
Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment
2020cites this paper
Filtering Noisy Parallel Corpus using Transformers with Proxy Task Learning
2020cites this paper
XMU Evaluation Report for CCMT2020
2020cites this paper
Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings
2019influential citation
Learning a Multi-Domain Curriculum for Neural Machine Translation
2019cites this paper
GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies
2019cites this paper
A Massive Collection of Cross-Lingual Web-Document Pairs
2019cites this paper
Vecalign: Improved Sentence Alignment in Linear Time and Space
2019cites this paper
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
2019cites this paper
Facebook AI’s WAT19 Myanmar-English Translation Task Submission
2019cites this paper
Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges
2019influential citation
Findings of the 2019 Conference on Machine Translation (WMT19)
2019cites this paper
Parallel Corpus Filtering Based on Fuzzy String Matching
2019cites this paper
Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data
2019cites this paper
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
2019influential citation