Selecting relevant text subsets from web-data for building topic specific language models

A. Sethy,P. Georgiou,Shrikanth S. Narayanan

Published 2006 in North American Chapter of the Association for Computational Linguistics

ABSTRACT

In this paper we present a scheme to select relevant subsets of sentences from a large generic corpus such as text acquired from the web. A relative entropy (R.E) based criterion is used to incrementally select sentences whose distribution matches the domain of interest. Experimental results show that by using the proposed subset selection scheme we can get significant performance improvement in both Word Error Rate (WER) and Perplexity (PPL) over the models built from the entire web-corpus by using just 10% of the data. In addition incremental data selection enables us to achieve significant reduction in the vocabulary size as well as number of n-grams in the adapted language model. To demonstrate the gains from our method we provide a comparative analysis with a number of methods proposed in recent language modeling literature for cleaning up text.

PUBLICATION RECORD

Publication year
2006
Venue
North American Chapter of the Association for Computational Linguistics
Publication date
2006-06-04
Fields of study
Computer Science
Identifiers
DOI 10.3115/1614049.1614086
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

--1 CONTENTS
2006cited by this paper
Web-data augmented language models for Mandarin conversational speech recognition
2005cited by this paper
Web-based models for natural language processing
2005cited by this paper
Rapid language model development using external resources for new spoken dialog domains
2005cited by this paper
Building topic specific language models from webdata using competitive models
2005cited by this paper
The Web as a Parallel Corpus
2003influential reference
Building text classifiers using positive and unlabeled examples
2003cited by this paper
Transonics: a speech to speech system for English-Persian interactions
2003cited by this paper
Text Classification from Labeled and Unlabeled Documents using EM
2000cited by this paper
Text Classi(cid:12)cation from Labeled and Unlabeled Documents using EM
1998cited by this paper

CITED BY

Unsupervised sentence selection for creating a representative corpus in Turkish: An active learning approach
2025cites this paper
The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development
2024cites this paper
Semi-Supervised Training and Data Augmentation for Adaptation of Automatic Broadcast News Captioning Systems
2019cites this paper
Cynical Selection of Language Model Training Data
2017influential citation
Classification-based spoken text selection for LVCSR language modeling
2017cites this paper
JU-USAAR: A Domain Adaptive MT System
2016cites this paper
Domain Control for Neural Machine Translation
2016cites this paper
Interactive post-editing in machine translation
2016cites this paper
Bilingual Language Model for English Arabic Technical Translation
2015cites this paper
Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation
2013cites this paper
MDI adaptation for the lazy: avoiding normalization in LM adaptation for lecture translation
2012cites this paper
Topic Adaptation for Lecture Translation through Bilingual Latent Semantic Models
2011cites this paper
Development of a WFST based Speech Recognition System for a Resource Deficient Language Using Machine Translation
2009cites this paper
Development of a speech recognition system for Icelandic using machine translated text
2008cites this paper
Language Model Adaptation Using Machine-Translated Text for Resource-Deficient Languages
2008cites this paper
Natural Language Dialogue Architectures for Tactical Questioning Characters
2008cites this paper