Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering

Philipp Koehn,Huda Khayrallah,Kenneth Heafield,M. Forcada

Published 2018 in Conference on Machine Translation

ABSTRACT

We posed the shared task of assigning sentence-level quality scores for a very noisy corpus of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high-quality data to be used to train machine translation systems. Seventeen participants from companies, national research labs, and universities participated in this task.

PUBLICATION RECORD

Publication year
2018
Venue
Conference on Machine Translation
Publication date
2018-10-31
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/W18-6453
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

On the Impact of Various Types of Noise on Neural Machine Translation
2018influential reference
Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora
2018cited by this paper
NICT’s Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task
2018cited by this paper
Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task
2018cited by this paper
The Speechmatics Parallel Corpus Filtering System for WMT18
2018cited by this paper
Tilde’s Parallel Corpus Filtering Methods for WMT 2018
2018cited by this paper
Marian: Fast Neural Machine Translation in C++
2018cited by this paper
A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora
2018cited by this paper
The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task
2018cited by this paper
An Unsupervised System for Parallel Corpus Filtering
2018cited by this paper
Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task
2018cited by this paper
MAJE Submission to the WMT2018 Shared Task on Parallel Corpus Filtering
2018cited by this paper
UTFPR at WMT 2018: Minimalistic Supervised Corpora Filtering for Machine Translation
2018cited by this paper
The JHU Parallel Corpus Filtering Systems for WMT 2018
2018cited by this paper
SYSTRAN Participation to the WMT2018 Shared Task on Parallel Corpus Filtering
2018cited by this paper
Alibaba Submission to the WMT18 Parallel Corpus Filtering Task
2018cited by this paper
STACC, OOV Density and N-gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering
2018cited by this paper
The ILSP/ARC submission to the WMT 2018 Parallel Corpus Filtering Shared Task
2018cited by this paper
Coverage and Cynicism: The AFRL Submission to the WMT 2018 Parallel Corpus Filtering Task
2018cited by this paper
Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora
2017cited by this paper
Synthetic and Natural Noise Both Break Neural Machine Translation
2017cited by this paper
Dynamic Data Selection for Neural Machine Translation
2017cited by this paper
Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation
2017cited by this paper
Findings of the WMT 2016 Bilingual Document Alignment Shared Task
2016cited by this paper
The United Nations Parallel Corpus v1.0
2016cited by this paper
Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus
2014cited by this paper
Bilingual Data Cleaning for SMT using Graph-based Random Walk
2013cited by this paper
Parallel Data, Tools and Interfaces in OPUS
2012cited by this paper
Parallel Corpus Refinement as an Outlier Detection Algorithm
2011cited by this paper
The Sentence-Aligned European Patent Corpus
2011cited by this paper
Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
2011cited by this paper
Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text
2011cited by this paper
Domain Adaptation via Pseudo In-Domain Data Selection
2011cited by this paper
MT Detection in Web-Scraped Parallel Corpora
2011cited by this paper
United Nations General Assembly Resolutions: A Six-Language Parallel Corpus
2009cited by this paper
Moses: Open Source Toolkit for Statistical Machine Translation
2007influential reference
Parallel corpora for medium density languages
2007cited by this paper
Europarl: A Parallel Corpus for Statistical Machine Translation
2005cited by this paper
Mining the Web for Bilingual Text
1999cited by this paper

CITED BY

AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts
2025cites this paper
Evaluating the LLM and NMT Models in Translating Low-Resourced Languages
2025cites this paper
Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora
2025cites this paper
A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain
2025influential citation
Call for Rigor in Reporting Quality of Instruction Tuning Data
2025cites this paper
GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task
2025cites this paper
Multilingual Data Filtering using Synthetic Data from Large Language Models
2025influential citation
Improving Speech Translation Through Data Augmentation with Data in Similar Languages
2025cites this paper
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
2024cites this paper
Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages
2024cites this paper
Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
2024cites this paper
Incremental and Flexible Extraction of Parallel Corpus from the Web
2024cites this paper
Improving Machine Translation using Corpus Filtering: A Survey
2023cites this paper
A Fast Method to Filter Noisy Parallel Data WMT2023 Shared Task on Parallel Data Curation
2023cites this paper
Findings of the WMT 2023 Shared Task on Parallel Data Curation
2023cites this paper
A Sentence Alignment Approach to Document Alignment and Multi-faceted Filtering for Curating Parallel Sentence Pairs from Web-crawled Data
2023cites this paper
Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation
2023cites this paper
On-the-Fly Fusion of Large Language Models and Machine Translation
2023cites this paper
Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text
2023cites this paper
There’s No Data like Better Data: Using QE Metrics for MT Data Filtering
2023cites this paper
Enhancing NLP Model Performance Through Data Filtering
2023cites this paper
Filtering Matters: Experiments in Filtering Training Sets for Machine Translation
2023cites this paper
Human evaluation of web-crawled parallel corpora for machine translation
2022cites this paper
Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs
2022cites this paper
Multi-Domain Adaptation in Neural Machine Translation with Dynamic Sampling Strategies
2022cites this paper
VANT: A Visual Analytics System for Refining Parallel Corpora in Neural Machine Translation
2022cites this paper
MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases
2022cites this paper
Bitext Mining for Low-Resource Languages via Contrastive Learning
2022cites this paper
Bicleaner AI: Bicleaner Goes Neural
2022cites this paper
Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages
2022cites this paper
Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages
2022cites this paper
Improving Khmer-Vietnamese Machine Translation with Data Augmentation methods
2022cites this paper
Data Cartography for Low-Resource Neural Machine Translation
2022cites this paper
Detecting Various Types of Noise for Neural Machine Translation
2022cites this paper
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation
2021cites this paper
I Don’t Need an Expert! Making URL Phishing Features Human Comprehensible
2021cites this paper
Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A Survey
2021cites this paper
Data Filtering using Cross-Lingual Word Embeddings
2021cites this paper
Should we find another model?: Improving Neural Machine Translation Performance with ONE-Piece Tokenization Method without Model Modification
2021cites this paper
Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC
2021cites this paper
Mixed Cross Entropy Loss for Neural Machine Translation
2021influential citation
C3SL at SemEval-2021 Task 1: Predicting Lexical Complexity of Words in Specific Contexts with Sentence Embeddings
2021cites this paper
Phenomenon-wise Evaluation Dataset Towards Analyzing Robustness of Machine Translation Models
2021influential citation
Secoco: Self-Correcting Encoding for Neural Machine Translation
2021cites this paper
Survey of Low-Resource Machine Translation
2021cites this paper
ParaMed: a parallel corpus for English–Chinese translation in the biomedical domain
2021cites this paper
XMU Evaluation Report for CCMT2020
2020cites this paper
Exploring Benefits of Transfer Learning in Neural Machine Translation
2020cites this paper
The Roles of Language Models and Hierarchical Models in Neural Sequence-to-Sequence Prediction
2020influential citation
Utterance Pair Scoring for Noisy Dialogue Data Filtering
2020influential citation
Bilingual Text Extraction as Reading Comprehension
2020cites this paper
Multilingual Unsupervised Sentence Simplification
2020cites this paper
Parallel Corpus Filtering via Pre-trained Language Models
2020influential citation
NEJM-enzh: A Parallel Corpus for English-Chinese Translation in the Biomedical Domain
2020cites this paper
SEDAR: a Large Scale French-English Financial Domain Parallel Corpus
2020cites this paper
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
2020cites this paper
Parallel Sentence Mining by Constrained Decoding
2020cites this paper
Character Mapping and Ad-hoc Adaptation: Edinburgh’s IWSLT 2020 Open Domain Translation System
2020cites this paper
Xiaomi’s Submissions for IWSLT 2020 Open Domain Translation Task
2020cites this paper
Octanove Labs’ Japanese-Chinese Open Domain Translation System
2020cites this paper
OpusFilter: A Configurable Parallel Corpus Filtering Toolbox
2020cites this paper
TICO-19: the Translation Initiative for Covid-19
2020cites this paper
Searching the Web for Cross-lingual Parallel Data
2020cites this paper
Bifixer and Bicleaner: two open-source tools to clean your parallel data
2020cites this paper
Extracting Parallel Sentences from Nonparallel Corpora Using Parallel Hierarchical Attention Network
2020cites this paper
Which *BERT? A Survey Organizing Contextualized Encoders
2020cites this paper
Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus
2020cites this paper
A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction
2020influential citation
Filtering Noisy Dialogue Corpora by Connectivity and Content Relatedness
2020cites this paper
Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank
2020cites this paper
Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings
2020cites this paper
Beyond English-Centric Multilingual Machine Translation
2020cites this paper
PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents
2020cites this paper
Extracting correctly aligned segments from unclean parallel data using character n-gram matching
2020influential citation
Self-Supervised Learning for Pairwise Data Refinement
2020cites this paper
Bicleaner at WMT 2020: Universitat d’Alacant-Prompsit’s submission to the parallel corpus filtering shared task
2020cites this paper
Tohoku-AIP-NTT at WMT 2020 News Translation Task
2020cites this paper
Improving Parallel Data Identification using Iteratively Refined Sentence Alignments and Bilingual Mappings of Pre-trained Language Models
2020cites this paper
Dual Conditional Cross Entropy Scores and LASER Similarity Scores for the WMT20 Parallel Corpus Filtering Shared Task
2020cites this paper
Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment
2020cites this paper
Selection of In-Domain Bilingual Sentence Pairs Based on Topic Information
2020cites this paper
Neural Machine Translation
2020cites this paper
Online Multilingual Neural Machine Translation
2019cites this paper
Learning a Multi-Domain Curriculum for Neural Machine Translation
2019cites this paper
Improving Neural Machine Translation by Filtering Synthetic Parallel Data
2019cites this paper
Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data
2019cites this paper
Neural Machine Translation: A Review
2019cites this paper
A Massive Collection of Cross-Lingual Web-Document Pairs
2019cites this paper
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
2019cites this paper
Robust Neural Machine Translation for Clean and Noisy Speech Transcripts
2019cites this paper
MeMAD Deliverable D 4 . 2 Report on Discourse-Aware Machine Translation for Audiovisual Data
2019cites this paper
Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus
2019cites this paper
Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach
2019cites this paper
Supervised and Nonlinear Alignment of Two Embedding Spaces for Dictionary Induction in Low Resourced Languages
2019cites this paper
Improving Neural Machine Translation Using Noisy Parallel Data through Distillation
2019cites this paper
Noisy Parallel Corpus Filtering through Projected Word Embeddings
2019cites this paper
NRC Parallel Corpus Filtering System for WMT 2019
2019cites this paper
Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data
2019cites this paper
Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data
2019cites this paper
uniblock: Scoring and Filtering Corpus with Unicode Block Information
2019cites this paper