Universal indexes for highly repetitive document collections

Francisco Claude,A. Fariña,Miguel A. Martínez-Prieto,G. Navarro

Published 2016 in Information Systems

ABSTRACT

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.

PUBLICATION RECORD

Publication year
2016
Venue
Information Systems
Publication date
2016-04-29
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.1016/j.is.2016.04.002 arXiv 1604.08897
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Practical compressed string dictionaries
2016cited by this paper
Composite Repetition-Aware Data Structures
2015cited by this paper
Locally Compressed Suffix Arrays
2015cited by this paper
SIMD compression and the intersection of sorted integers
2014cited by this paper
LZ77-Based Self-indexing with Faster Pattern Matching
2014cited by this paper
Partitioned Elias-Fano indexes
2014influential reference
Compression, SIMD, and Postings Lists
2014cited by this paper
Document Retrieval on Repetitive Collections
2014cited by this paper
On compressing and indexing repetitive sequences
2013influential reference
Document Listing on Repetitive Collections
2013cited by this paper
Optimizing top-k document retrieval strategies for block-max indexes
2013cited by this paper
Faster and smaller inverted indices with treaps
2013cited by this paper
Document Listing on Versioned Documents
2013cited by this paper
DACs: Bringing direct access to variable-length codes
2013cited by this paper
On position restricted substring searching in succinct space
2012cited by this paper
Word-based self-indexes for natural language text
2012cited by this paper
Optimizing positional index structures for versioned document collections
2012cited by this paper
Fast Relative Lempel-Ziv Self-index for Similar Sequences
2012cited by this paper
Faster top-k document retrieval using block-max indexes
2011cited by this paper
SIMD-based decoding of posting lists
2011cited by this paper
Improved Grammar-Based Compressed Indexes
2011cited by this paper
A Faster Grammar-Based Self-index
2011cited by this paper
Self-Indexed Grammar-Based Compression
2011influential reference
Improved index compression techniques for versioned document collections
2010influential reference
Engineering basic algorithms of an in-memory text search engine
2010influential reference
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences
2010influential reference
Storage and Retrieval of Highly Repetitive Sequence Collections
2010influential reference
Efficient set intersection for inverted indexing
2010influential reference
Fast integer compression using SIMD instructions
2010cited by this paper
Index compression using 64-bit words
2010cited by this paper
Scalable techniques for document identifier assignment in inverted indexes
2010cited by this paper
Information Retrieval: Implementing and Evaluating Search Engines
2010cited by this paper
Compact full-text indexing of versioned document collections
2009cited by this paper
Compressing term positions in web indexes
2009cited by this paper
An experimental investigation of set intersection algorithms for text searching
2009influential reference
Inverted index compression and query processing with optimized document ordering
2009influential reference
Performance of compressed inverted list caching in search engines
2008influential reference
Practical Rank/Select Queries over Arbitrary Sequences
2008cited by this paper
Pushdown Compression
2007cited by this paper
Efficient document retrieval in main memory
2007cited by this paper
Compressed full-text indexes
2007cited by this paper
Efficient search in large textual collections with redundancy
2007cited by this paper
Inverted files for text search engines
2006cited by this paper
Indexing Shared Content in Information Retrieval Systems
2006cited by this paper
Super-Scalar RAM-CPU Cache Compression
2006influential reference
Representing Trees of Higher Degree
2005cited by this paper
PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES
2005cited by this paper
Super-Scalar Database Compression between RAM and CPU Cache
2005cited by this paper
Inverted Index Compression Using Word-Aligned Binary Codes
2004influential reference
A Fast Set Intersection Algorithm for Sorted Sequences
2004cited by this paper
A fully linear-time approximation algorithm for grammar-based compression
2003cited by this paper
Inverted file compression through document identifier reassignment
2003cited by this paper
New text indexing functionalities of the compressed suffix arrays
2003cited by this paper
Succinct Representations of Permutations
2003cited by this paper
Searching large text collections
2002cited by this paper
Adaptive intersection and t-threshold problems
2002cited by this paper
Application of Lempel-Ziv factorization to the approximation of grammar-based compression
2002cited by this paper
Adaptive set intersections, unions, and differences
2000cited by this paper
Fast and flexible word searching on compressed text
2000cited by this paper
Binary Interpolative Coding for Effective Index Compression
2000cited by this paper
Compressing Integers for Fast File Access
1999cited by this paper
Off-line dictionary-based compression
1999influential reference
Compact pat trees
1998cited by this paper
Managing gigabytes
1994cited by this paper
Versioning a full-text information retrieval system
1992cited by this paper
A universal algorithm for sequential data compression
1977cited by this paper
The source code control system
1975cited by this paper
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
1968cited by this paper
Ieee Transactions on Information Theory the Smallest Grammar Problem
year unknowncited by this paper

CITED BY

Onto how compression yields energy-efficient text search
2025cites this paper
Practical Parallel Block Tree Construction: First Results
2025cites this paper
Faster Block Tree Construction
2023cites this paper
Compressing Integer Lists with Contextual Arithmetic Trits
2022cites this paper
Indexing Highly Repetitive String Collections, Part I
2021cites this paper
FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns
2021cites this paper
Big Data Full-Text Search Index Minimization Using Text Summarization
2021cites this paper
Grammar Index By Induced Suffix Sorting
2021cites this paper
Grammar-Compressed Indexes with Logarithmic Search Time
2020cites this paper
Block Tree based Universal Self-Index for Repetitive Text Collections
2020cites this paper
Indexing Highly Repetitive String Collections, Part II
2020influential citation
Rpair: Rescaling RePair with Rsync
2019cites this paper
Inverted Index Compression
2019cites this paper
Set operations over compressed binary relations
2019cites this paper
Compressed Indexes for Repetitive Textual Datasets
2019cites this paper
On the Reproducibility of Experiments of Indexing Repetitive Document Collections
2019influential citation
Practical Indexing of Repetitive Collections Using Relative Lempel-Ziv
2019cites this paper
Techniques for Inverted Index Compression
2019cites this paper
Indexing Highly Repetitive Collections via Grammar Compression
2019cites this paper
On the Reproducibility of Experiments of Indexing Repetitive Document
2018cites this paper
Universal Compressed Text Indexing
2018cites this paper
Hybrid compression of inverted lists for reordered document collections
2018cites this paper
Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space
2018influential citation
About BIRDS project (Bioinformatics and Information Retrieval Data Structures Analysis and Design)
2018cites this paper
Variable-Byte Encoding is Now Space-Efficient Too
2018cites this paper
Compressed and efficient algorithms and data structures for strings
2018cites this paper
Computation over Compressed Structured Data Organizers :
2017cites this paper
Optimal-Time Text Indexing in BWT-runs Bounded Space
2017cites this paper
Compressed Computation for Text Indexing
2017cites this paper
Time-space trade-offs for Lempel-Ziv compressed indexing
2017cites this paper
LZ-End Parsing in Linear Time
2017cites this paper
A Self-index on Block Trees
2016cites this paper
Document retrieval on repetitive string collections
2016cites this paper