We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1-2 million characters of the text per second, and finds patterns at a rate of 10-50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.
Self-indexing Based on LZ77
Published 2011 in Annual Symposium on Combinatorial Pattern Matching
ABSTRACT
PUBLICATION RECORD
- Publication year
2011
- Venue
Annual Symposium on Combinatorial Pattern Matching
- Publication date
2011-01-20
- Fields of study
Mathematics, Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-36 of 36 references · Page 1 of 1
CITED BY
Showing 1-75 of 75 citing papers · Page 1 of 1