Storyboard: Optimizing Precomputed Summaries for Aggregation

Published 2020 in arXiv.org

ABSTRACT

An emerging class of data systems partition their data and precompute approximate summaries (i.e., sketches and samples) for each segment to reduce query costs. They can then aggregate and combine the segment summaries to estimate results without scanning the raw data. However, given limited storage space each summary introduces approximation errors that affect query accuracy. For instance, systems that use existing mergeable summaries cannot reduce query error below the error of an individual precomputed summary. We introduce Storyboard, a query system that optimizes item frequency and quantile summaries for accuracy when aggregating over multiple segments. Compared to conventional mergeable summaries, Storyboard leverages additional memory available for summary construction and aggregation to derive a more precise combined result. This reduces error by up to 25x over interval aggregations and 4.4x over data cube aggregations on industrial datasets compared to standard summarization methods, with provable worst-case error guarantees.

PUBLICATION RECORD

Publication year
2020
Venue
arXiv.org
Publication date
2020-02-08
Fields of study
Computer Science
Identifiers
arXiv 2002.03063
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Computing Extremely Accurate Quantiles Using t-Digests
2019cited by this paper
Hillview: A trillion-cell spreadsheet for big data
2019cited by this paper
DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees
2019cited by this paper
AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics
2018cited by this paper
Moment-Based Quantile Sketches for Efficient High Cardinality Aggregation Queries
2018cited by this paper
Answering Range Queries Under Local Differential Privacy
2018cited by this paper
Approximate Query Processing: What is New and Where to Go?
2018cited by this paper
Stream Frequency Over Interval Queries
2018cited by this paper
DIFF: a relational interface for large-scale data explanation
2018cited by this paper
Pinot: Realtime OLAP for 530 Million Users
2018cited by this paper
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
2017cited by this paper
Quantiles over data streams: experimental comparisons, new analyses, and further improvements
2016cited by this paper
Coresets and Sketches
2016cited by this paper
Optimal Quantile Approximation in Streams
2016cited by this paper
MacroBase: Prioritizing Attention in Fast Data
2016cited by this paper
A Handbook for Building an Approximate Query Engine
2015cited by this paper
Druid: a real-time analytical data store
2014cited by this paper
Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area
2014cited by this paper
Understanding Hierarchical Methods for Differentially Private Histograms
2013influential reference
BlinkDB: queries with bounded errors and bounded response times on very large data
2012cited by this paper
Mergeable summaries
2012cited by this paper
Optimal sampling algorithms for frequency estimation in distributed data
2011influential reference
Structure-aware sampling
2011cited by this paper
Methods for finding frequent items in data streams
2010cited by this paper
Algorithms for epsilon-Approximations of Terrains
2008cited by this paper
Algorithms for ε-approximations of Terrains ?
2008cited by this paper
Optimized stratified sampling for approximate query processing
2007influential reference
Unbiased Matrix Rounding
2006cited by this paper
Finding global icebergs over distributed data sets
2006cited by this paper
Efficient Computation of Frequent and Top-k Elements in Data Streams
2005cited by this paper
Data streams: algorithms and applications
2005cited by this paper
Approximate counts and quantiles over sliding windows
2004cited by this paper
An improved data stream summary: the count-min sketch and its applications
2004influential reference
SciPy: Open Source Scientific Tools for Python
2001cited by this paper
Space-efficient online computation of quantile summaries
2001cited by this paper
The Discrepancy Method: Randomness and Complexity
2000cited by this paper
The Aqua approximate query answering system
1999cited by this paper
Range queries in OLAP data cubes
1997cited by this paper
Online aggregation
1997cited by this paper
Implementing data cubes efficiently
1996cited by this paper
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
1996influential reference
A Limited Memory Algorithm for Bound Constrained Optimization
1995cited by this paper
Random sampling with a reservoir
1985cited by this paper
Finding Repeated Elements
1982cited by this paper
Balancing games
1977cited by this paper

CITED BY

No citing papers are available for this paper.