Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

Ehsan Shareghi,M. Petri,Gholamreza Haffari,Trevor Cohn

Published 2015 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Efficient methods for storing and querying language models are critical for scaling to large corpora and high Markov orders. In this paper we propose methods for modeling extremely large corpora without imposing a Markov condition. At its core, our approach uses a succinct index ‐ a compressed suffix tree ‐ which provides near optimal compression while supporting efficient search. We present algorithms for on-the-fly computation of probabilities under a Kneser-Ney language model. Our technique is exact and although slower than leading LM toolkits, it shows promising scaling properties, which we demonstrate through1-order modeling over the full Wikipedia collection.

PUBLICATION RECORD

Publication year
2015
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
2015-09-01
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.18653/v1/D15-1288
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Statistical Machine Translation
2014cited by this paper
From Theory to Practice: Plug and Play with Succinct Data Structures
2013cited by this paper
Suffix Trees as Language Models
2012cited by this paper
KenLM: Faster and Smaller Language Model Queries
2011cited by this paper
The sequence memoizer
2011cited by this paper
SRILM at Sixteen: Update and Outlook
2011cited by this paper
Faster and Smaller N-Gram Language Models
2011cited by this paper
Unary Data Structures for Language Models
2011cited by this paper
Bidirectional Search in a String with Wavelet Trees
2010cited by this paper
Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval
2010cited by this paper
A Succinct N-gram Language Model
2009cited by this paper
Tightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too
2009cited by this paper
Fully compressed suffix trees
2008cited by this paper
Compressed Suffix Trees with Full Functionality
2007influential reference
Compressed text indexes: From theory to practice
2007cited by this paper
Compressed representations of sequences and full-text indexes
2007influential reference
Large Language Models in Machine Translation
2007cited by this paper
Randomised Language Modelling for Statistical Machine Translation
2007cited by this paper
Linear work suffix array construction
2006cited by this paper
The Bloomier filter: an efficient data structure for static support lookup tables
2004influential reference
SRILM - an extensible language modeling toolkit
2002cited by this paper
A novel autosomal dominant distal myopathy with early respiratory failure
2001cited by this paper
Reducing the space requirement of suffix trees
1999cited by this paper
An Empirical Study of Smoothing Techniques for Language Modeling
1996cited by this paper
On-line construction of suffix trees
1995cited by this paper
Improved backing-off for M-gram language modeling
1995cited by this paper
A Block-sorting Lossless Data Compression Algorithm
1994influential reference
Suffix arrays: a new method for on-line string searches
1993cited by this paper
Fundamentals of speech recognition
1993cited by this paper
Linear Pattern Matching Algorithms
1973cited by this paper

CITED BY

Review of current and potential uses of large language models in engineering
2025cites this paper
Interpretable Next-token Prediction via the Generalized Induction Head
2024cites this paper
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
2024cites this paper
Koala: An Index for Quantifying Overlaps with Pre-training Corpora
2023cites this paper
Space-Efficient Algorithms for Strings and Prefix-Sortable Graphs
2020cites this paper
Fixed Block Compression Boosting in FM-Indexes: Theory and Practice
2018cites this paper
Handling Massive N-Gram Datasets Efficiently
2018cites this paper
A framework for space-efficient variable-order Markov models
2018cites this paper
Compressed Nonparametric Language Modelling
2017influential citation
Succinct Data Structures in Information Retrieval: Theory and Practice
2016cites this paper
Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling
2016cites this paper
CSA++: Fast Pattern Search for Large Alphabets
2016cites this paper
Learning Succinct Models: Pipelined Compression with L1-Regularization, Hashing, Elias-Fano Indices, and Quantization
2016cites this paper
Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees
2016influential citation