Domain adaptation of statistical machine translation with domain-focused web crawling

Pavel Pecina,Antonio Toral,V. Papavassiliou,Prokopis Prokopidis,Ales Tamchyna,Andy Way,Josef van Genabith

Published 2014 in Language Resources and Evaluation

ABSTRACT

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.

PUBLICATION RECORD

Publication year
2014
Venue
Language Resources and Evaluation
Publication date
2014-12-03
Fields of study
Medicine, Linguistics, Computer Science
Identifiers
DOI 10.1007/s10579-014-9282-3 PMID 26120290 PMCID 4479164
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Research and Advanced Technology for Digital Libraries
2016cited by this paper
A modular open-source focused crawler for mining monolingual and bilingual corpora from the web
2013cited by this paper
Hybrid Selection of Language Model Training Data Using Linguistic Information and Perplexity
2013cited by this paper
Applying Pairwise Ranked Optimisation to Improve the Interpolation of Translation Models
2013cited by this paper
Domain Adaptation in Machine Translation : Final Report
2013cited by this paper
Quality Estimation-guided Data Selection for Domain Adaptation of SMT
2013cited by this paper
Towards a User-Friendly Platform for Building Language Resources based on Web Services
2012cited by this paper
Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study
2012cited by this paper
Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation
2012cited by this paper
Interpolated Backoff for Factored Translation Models
2012cited by this paper
Simple and Effective Parameter Tuning for Domain Adaptation of Statistical Machine Translation
2012cited by this paper
Harvesting Parallel Text in Multiple Languages with Limited Supervision
2012cited by this paper
Combining translation and language model scoring for domain-specific data filtering
2011cited by this paper
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component Level Mixture Modelling
2011cited by this paper
Fill-up versus interpolation methods for phrase-based SMT adaptation
2011cited by this paper
Domain Adaptation for Machine Translation by Mining Unseen Words
2011cited by this paper
Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation
2011cited by this paper
Domain Adaptation via Pseudo In-Domain Data Selection
2011cited by this paper
Boilerplate detection using shallow text features
2010influential reference
Focused Web Crawling Based on Incremental Learning
2010cited by this paper
Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers
2010influential reference
Improving the Post-Editing Experience using Translation Recommendation: A User Study
2010cited by this paper
Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation
2010cited by this paper
Intelligent Selection of Language Model Training Data
2010cited by this paper
Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor
2010cited by this paper
Log-linear weight optimisation via Bayesian Adaptation in Statistical Machine Translation
2010cited by this paper
Improved Minimum Error Rate Training in Moses
2009cited by this paper
Domain Adaptation for Statistical Machine Translation with Monolingual Resources
2009cited by this paper
Web page classification: Features and algorithms
2009cited by this paper
The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
2009cited by this paper
Focused Crawling : algorithm survey and new approaches with a manual analysis
2008cited by this paper
Dynamic Model Interpolation for Statistical Machine Translation
2008cited by this paper
Victor: the Web-Page Cleaning Tool
2008cited by this paper
MaTrEx: The DCU MT System for WMT 2008
2008cited by this paper
Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora
2008cited by this paper
Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing
2008cited by this paper
WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content
2008cited by this paper
Parallel corpora for medium density languages
2007influential reference
Moses: Open Source Toolkit for Statistical Machine Translation
2007influential reference
Mixture-Model Adaptation for SMT
2007cited by this paper
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
2007cited by this paper
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
2007cited by this paper
Focused crawler software package
2007influential reference
Improving Statistical Machine Translation Using Word Sense Disambiguation
2007cited by this paper
Improving Translation Quality by Discarding Most of the Phrasetable
2007cited by this paper
Ant Focused Crawling Algorithm
2006cited by this paper
A Study of Translation Edit Rate with Targeted Human Annotation
2006cited by this paper
Automatic Acquisition of Chinese-English Parallel Corpus from the Web
2006cited by this paper
WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators
2006influential reference
A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION
2005cited by this paper
Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain
2005cited by this paper
A General Evaluation Framework for Topical Crawlers
2005cited by this paper
Mapping the semantics of Web text and links
2005cited by this paper
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora
2005cited by this paper
Adaptation of the translation model for statistical machine translation based on information retrieval
2005cited by this paper
Europarl: A Parallel Corpus for Statistical Machine Translation
2005influential reference
Editorial: State of the Transactions
2004cited by this paper
Discovering Parallel Text from the World Wide Web
2004cited by this paper
Statistical Significance Tests for Machine Translation Evaluation
2004cited by this paper
PEBL: Web page classification without negative examples
2004cited by this paper
Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval
2004cited by this paper
Improving domain-specific word alignment with a general bilingual corpus
2004cited by this paper
The Web as a Parallel Corpus
2003cited by this paper
Minimum Error Rate Training in Statistical Machine Translation
2003cited by this paper
Introduction to the Special Issue on the Web as Corpus
2003cited by this paper
SRILM - an extensible language modeling toolkit
2002cited by this paper
Improving a general-purpose Statistical Translation Engine by Terminological lexicons
2002cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
Focused Crawls, Tunneling, and Digital Libraries
2002cited by this paper
Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web
2000cited by this paper
Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web
1999cited by this paper
Efficient Crawling Through URL Ordering
1998cited by this paper
The Anatomy of a Large-Scale Hypertextual Web Search Engine
1998cited by this paper
Accelerated DP based search for statistical translation
1997cited by this paper
Conference of the Association for Machine Translation in the Americas, AMTA 1996, Montreal, Canada, October 2-5, 1996
1996cited by this paper
Improved backing-off for M-gram language modeling
1995cited by this paper
Structural Non-Correspondence in Translation
1991cited by this paper
Edinburgh Research Explorer Experiments in Domain Adaptation for Statistical Machine Translation
year unknowncited by this paper

CITED BY

A Data Augmentation Method for English-Vietnamese Neural Machine Translation
2023cites this paper
A comparative study on web page ranking algorithms
2023cites this paper
Language technologies for a multilingual public administration in Spain
2023cites this paper
Rapid Development of Competitive Translation Engines for Access to Multilingual COVID-19 Information
2020cites this paper
Web crawling and domain adaptation methods for building English–Greek machine translation systems for the culture/tourism domain
2020cites this paper
Findings of the WMT 2019 Biomedical Translation Shared Task: Evaluation for MEDLINE Abstracts and Biomedical Terminologies
2019cites this paper
Dynamic Machine Translation of Croatian Academic Web Sites
2019cites this paper
An Automatic and a Machine-assisted Method to Clean Bilingual Corpus
2019cites this paper
The ILSP/ARC submission to the WMT 2018 Parallel Corpus Filtering Shared Task
2018cites this paper
Efficient Topical Focused Crawling Through Neighborhood Feature
2018cites this paper
Evaluation of Machine Translation Performance Across Multiple Genres and Languages
2018cites this paper
What Level of Quality can Neural Machine Translation Attain on Literary Text?
2018cites this paper
ELICITING DATA FROM WEBSITE USING SCRAPY: AN EXAMPLE
2017cites this paper
A Maturity Model for Public Administration as Open Translation Data Providers
2016cites this paper
Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair
2016influential citation
The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task
2016cites this paper
MULTI MODEL DYNAMIC WEB CRAWLER WITH HIERARCHICAL SUBSPACE CLUSTERING FOR EFFICIENT WEB SEARCH USING T 2 S PAGE RANKING
2015cites this paper
English Language Statistical Machine Translation Oriented Classification Algorithm
2015cites this paper
Survey of data-selection methods in statistical machine translation
2015cites this paper