Self-indexing Based on LZ77

Published 2011 in Annual Symposium on Combinatorial Pattern Matching

ABSTRACT

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1-2 million characters of the text per second, and finds patterns at a rate of 10-50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.

PUBLICATION RECORD

Publication year
2011
Venue
Annual Symposium on Combinatorial Pattern Matching
Publication date
2011-01-20
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.1007/978-3-642-21458-5_6 arXiv 1101.4065
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Self-Index based on LZ77 (thesis)
2011cited by this paper
Self-Index Based on LZ77
2011cited by this paper
Extracting Powers and Periods in a String from Its Runs Structure
2010cited by this paper
LZ77-Like Compression with Fast Random Access
2010cited by this paper
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences
2010cited by this paper
Improved index compression techniques for versioned document collections
2010cited by this paper
Succinct Trees in Practice
2010cited by this paper
Storage and Retrieval of Highly Repetitive Sequence Collections
2010cited by this paper
Advantages of Shared Data Structures for Sequences of Balanced Parentheses
2010cited by this paper
Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
2010cited by this paper
LATIN 2010: Theoretical Informatics, 9th Latin American Symposium, Oaxaca, Mexico, April 19-23, 2010. Proceedings
2010cited by this paper
Directly Addressable Variable-Length Codes
2009cited by this paper
Optimal Succinctness for Range Minimum Queries
2008cited by this paper
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections
2008cited by this paper
Succincter
2008cited by this paper
Compressed full-text indexes
2007cited by this paper
Rank and select revisited and extended
2007cited by this paper
Compressed representations of sequences and full-text indexes
2007cited by this paper
Practical Entropy-Compressed Rank/Select Dictionary
2006cited by this paper
A compressed self-index using a Ziv–Lempel dictionary
2006cited by this paper
Representing Trees of Higher Degree
2005influential reference
Succinct Representations of Permutations
2003cited by this paper
New text indexing functionalities of the compressed suffix arrays
2003cited by this paper
High-order entropy-compressed text indexes
2003cited by this paper
Indexing Compressed Text
2003cited by this paper
Efficient algorithms for document retrieval problems
2002cited by this paper
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
2002cited by this paper
Indexing text using the Ziv-Lempel trie
2002cited by this paper
An analysis of the Burrows-Wheeler transform
2001cited by this paper
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)
2000cited by this paper
Repetition-Based Text Indexes
1999influential reference
Lempel-Ziv parsing and sublinear-size index structures for string matching
1996influential reference
A Block-sorting Lossless Data Compression Algorithm
1994cited by this paper
Compression of individual sequences via variable-rate coding
1978cited by this paper
A universal algorithm for sequential data compression
1977cited by this paper
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
1968cited by this paper

CITED BY

Repetition Aware Text Indexing for Matching Patterns with Wildcards
2025cites this paper
Contextual Pattern Mining and Counting
2025cites this paper
GPUFASTQLZ: An Ultra Fast Compression Methodology for Fastq Sequence Data on GPUs
2024cites this paper
Contextual Pattern Matching in Less Space
2023influential citation
Local Editing in LZ-End Compressed Data
2020cites this paper
Indexing Highly Repetitive String Collections, Part II
2020cites this paper
Efficient regular expression matching on LZ77 compressed strings using negative factors
2019influential citation
ER-index: a referential index for encrypted genomic databases
2019cites this paper
Entropy Lower Bounds for Dictionary Compression
2019cites this paper
Space-Efficient Fully Dynamic DFS in Undirected Graphs
2019influential citation
Using statistical encoding to achieve tree succinctness never seen before
2018cites this paper
Efficient Collection of Connected Vehicles Data With Precision Guarantees
2018cites this paper
Entropy bounds for grammar compression
2018cites this paper
LZ-End Parsing in Linear Time
2017cites this paper
Compressed Computation for Text Indexing
2017influential citation
Flexible Indexing of Repetitive Collections
2017cites this paper
Towards Efficient Positional Inverted Index
2017cites this paper
Efficient Regular Expression Matching on Compressed Strings
2017cites this paper
A Space-Efficient Algorithm for the Dynamic DFS Problem in Undirected Graphs
2017cites this paper
Online Grammar-Based Self-Index and Its Applications
2017cites this paper
Converting Panax ginseng DNA and chemical fingerprints into two-dimensional barcode
2016cites this paper
Lempel-Ziv Decoding in External Memory
2016cites this paper
LZ-End Parsing in Compressed Space
2016cites this paper
Efficient Approximate Substring Matching in Compressed String
2016cites this paper
Web-Age Information Management
2016influential citation
Data Compression in Database Query Processing
2016cites this paper
Tree Contraction for Compressed Suffix Arrays on Modern Processors
2015cites this paper
Fast Online Lempel-Ziv Factorization in Compressed Space
2015cites this paper
Computing LZ77 in Run-Compressed Space
2015cites this paper
String hashing for collection-based compression
2015cites this paper
Efficient Construction of Fundamental Data Structures in Large-Scale Text Indexing
2015cites this paper
Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data
2014cites this paper
Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”
2014cites this paper
Compressing Similar Biological Sequences Using FM-Index
2014cites this paper
Lempel-Ziv factorization: Simple, fast, practical
2013cites this paper
Computing Reversed Lempel-Ziv Factorization Online
2013cites this paper
On compressing and indexing repetitive sequences
2013influential citation
Crochemore's String Matching Algorithm: Simplification, Extensions, Applications
2013cites this paper
Efficient direct search on compressed genomic data
2013cites this paper
Aide à l'analyse de traces d'exécution dans le contexte des microcontrôleurs 32 bits. (Assit to execution trace analysis in the microcontrollers 32 bits context)
2013cites this paper
DACs: Bringing direct access to variable-length codes
2013cites this paper
Lempel-Ziv Parsing in External Memory
2013cites this paper
Supplementary material for The Human Genome Contracts Again
2013cites this paper
Faster Compact On-Line Lempel-Ziv Factorization
2013cites this paper
Space Efficient Linear Time Lempel-Ziv Factorization on Constant~Size~Alphabets
2013cites this paper
Space-Efficient Data Structures for Information Retrieval
2013influential citation
Lightweight Lempel-Ziv Parsing
2013cites this paper
Practical Compressed Suffix Trees
2013cites this paper
Compressed Suffix Trees for Repetitive Texts
2012cites this paper
Wavelet trees: A survey
2012cites this paper
Compression of large DNA databases
2012influential citation
Linear Time Lempel-Ziv Factorization: Simple, Fast, Small
2012cites this paper
Compact binary relation representations with rich functionality
2012cites this paper
The Wavelet Matrix
2012cites this paper
Mathematical Foundations of Computer Science 2012
2012cites this paper
Applications of Compressed Data Structures on Sequences and Structured Data
2012cites this paper
Compressed Full-Text Indexes for Highly Repetitive Collections
2012cites this paper
COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES
2012influential citation
Linear-Space Substring Range Counting over Polylogarithmic Alphabets
2012influential citation
Fast Relative Lempel-Ziv Self-index for Similar Sequences
2012influential citation
Wavelet trees for all
2012cites this paper
Indexes for highly repetitive document collections
2011influential citation
Algorithms and Compressed Data Structures for Information Retrieval
2011cites this paper
Improved Grammar-Based Compressed Indexes
2011cites this paper
ESP-index: A compressed index based on edit-sensitive parsing
2011influential citation
A Faster Grammar-Based Self-index
2011influential citation
A Compressed Self-Index for Genomic Databases
2011cites this paper
Faster Approximate Pattern Matching in Compressed Repetitive Texts
2011influential citation
Restructuring Compressed Texts without Explicit Decompression
2011cites this paper
Reference Sequence Construction for Relative Compression of Genomes
2011cites this paper
Robust relative compression of genomes with random access
2011cites this paper
Space Efficient Wavelet Tree Construction
2011cites this paper
An Online Algorithm for Lightweight Grammar-Based Compression
2011cites this paper
Practical Compressed Suffix Trees
2010cites this paper
Renewable and Sustainable Energy Reviews
year unknowncites this paper