Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

Ehsan Shareghi,M. Petri,Gholamreza Haffari,Trevor Cohn

Published 2015 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Efficient methods for storing and querying language models are critical for scaling to large corpora and high Markov orders. In this paper we propose methods for modeling extremely large corpora without imposing a Markov condition. At its core, our approach uses a succinct index ‐ a compressed suffix tree ‐ which provides near optimal compression while supporting efficient search. We present algorithms for on-the-fly computation of probabilities under a Kneser-Ney language model. Our technique is exact and although slower than leading LM toolkits, it shows promising scaling properties, which we demonstrate through1-order modeling over the full Wikipedia collection.

PUBLICATION RECORD

  • Publication year

    2015

  • Venue

    Conference on Empirical Methods in Natural Language Processing

  • Publication date

    2015-09-01

  • Fields of study

    Mathematics, Computer Science

  • Identifiers
  • External record

    Open on Semantic Scholar

  • Source metadata

    Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-30 of 30 references · Page 1 of 1