The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task

Published 2019 in Conference on Machine Translation

ABSTRACT

This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.

PUBLICATION RECORD

Publication year
2019
Venue
Conference on Machine Translation
Publication date
2019-07-29
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/W19-5441
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering
2018influential reference
On the Impact of Various Types of Noise on Neural Machine Translation
2018cited by this paper
Efficient Word Alignment with Markov Chain Monte Carlo
2016cited by this paper
Neural Machine Translation of Rare Words with Subword Units
2015cited by this paper
langid.py: An Off-the-shelf Language Identification Tool
2012cited by this paper
Morfessor and variKN machine learning tools for speech and language technology
2007cited by this paper
On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models
2007cited by this paper
Moses: Open Source Toolkit for Statistical Machine Translation
2007cited by this paper
Estimating the Dimension of a Model
1978cited by this paper
A new look at the statistical model identification
1974cited by this paper

CITED BY

Conditional Unigram Tokenization with Parallel Data
2025cites this paper
Beyond Literal Token Overlap: Token Alignability for Multilinguality
2025cites this paper
Four Approaches to Low-Resource Multilingual NMT: The Helsinki Submission to the AmericasNLP 2023 Shared Task
2023influential citation
Unsupervised Feature Selection for Effective Parallel Corpus Filtering
2023cites this paper
Democratizing Machine Translation with OPUS-MT
2022cites this paper
Democratizing neural machine translation with OPUS-MT
2022cites this paper
Boosting Neural Machine Translation from Finnish to Northern Sámi with Rule-Based Backtranslation
2021influential citation
The Helsinki submission to the AmericasNLP shared task
2021influential citation
OpusFilter: A Configurable Parallel Corpus Filtering Toolbox
2020influential citation
Bicleaner at WMT 2020: Universitat d’Alacant-Prompsit’s submission to the parallel corpus filtering shared task
2020cites this paper
Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
2019influential citation