Intersecting Multilingual Data for Faster and Better Statistical Translations

Published 2009 in North American Chapter of the Association for Computational Linguistics

ABSTRACT

In current phrase-based SMT systems, more training data is generally better than less. However, a larger data set eventually introduces a larger model that enlarges the search space for the translation problem, and consequently requires more time and more resources to translate. We argue redundant information in a SMT system may not only delay the computations but also affect the quality of the outputs. This paper proposes an approach to reduce the model size by filtering out the less probable entries based on compatible data in an intermediate language, a novel use of triangulation, without sacrificing the translation quality. Comprehensive experiments were conducted on standard data sets. We achieved significant quality improvements (up to 2.3 Bleu points) while translating with reduced models. In addition, we demonstrate a straightforward combination method for more progressive filtering. The reduction of the model size can be up to 94% with the translation quality being preserved.

PUBLICATION RECORD

Publication year
2009
Venue
North American Chapter of the Association for Computational Linguistics
Publication date
2009-05-31
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.3115/1620754.1620773
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Improving Statistical Machine Translation Efficiency by Triangulation
2008cited by this paper
Moses: Open Source Toolkit for Statistical Machine Translation
2007cited by this paper
Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora
2007cited by this paper
Improving Translation Quality by Discarding Most of the Phrasetable
2007cited by this paper
Improving Word Alignment with Bridge Languages
2007cited by this paper
Europarl: A Parallel Corpus for Statistical Machine Translation
2005cited by this paper
Leveraging multiple languages to improve statistical MT word alignments
2005cited by this paper
The Proper Place of Men and Machines in Language Translation
2004cited by this paper
Noun phrase translation
2003cited by this paper
The Proper Place of Men and Machines in Language Translation
2003cited by this paper
Minimum Error Rate Training in Statistical Machine Translation
2003influential reference
Bleu: a Method for Automatic Evaluation of Machine Translation
2002influential reference
Statistical multi-source translation
2001cited by this paper
Text-Translation Alignment: Three Languages Are Better Than Two
1999cited by this paper

CITED BY

English – Igala Parallel Corpora for Natural Language Processing Applications
2017cites this paper
Phrase Table Pruning via Submodular Function Maximization
2016cites this paper
The 54th Annual Meeting of the Association for Computational Linguistics
2016cites this paper
An overview of the European Union’s highly multilingual parallel corpora
2014cites this paper
A Relationship: Word Alignment, Phrase Table, and Translation Quality
2014cites this paper
An Investigation of the Sampling-Based Alignment Method and Its Contributions
2013cites this paper
A Systematic Comparison of Phrase Table Pruning Techniques
2012cites this paper
Improving SMT by Using Parallel Data of a Closely Related Language
2012cites this paper
Conditional Significance Pruning: Discarding More of Huge Phrase Tables
2012cites this paper
Learning to Simplify Sentences Using Wikipedia
2011cites this paper
Mining Parallel Data from Comparable Corpora via Triangulation
2011cites this paper
MultiUN: A Multilingual Corpus from United Nation Documents
2010cites this paper
Phrase table pruning for Statistical Machine Translation
2010cites this paper
Hitting the Right Paraphrases in Good Time
2010cites this paper