BlinkDB: queries with bounded errors and bounded response times on very large data

Sameer Agarwal,Barzan Mozafari,Aurojit Panda,Henry Milner,S. Madden,Ion Stoica

Published 2012 in European Conference on Computer Systems

ABSTRACT

In this paper, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy or response time requirements. We evaluate BlinkDB against the well-known TPC-H benchmarks and a real-world analytic workload derived from Conviva Inc., a company that manages video distribution over the Internet. Our experiments on a 100 node cluster show that BlinkDB can answer queries on up to 17 TBs of data in less than 2 seconds (over 200 x faster than Hive), within an error of 2-10%.

PUBLICATION RECORD

Publication year
2012
Venue
European Conference on Computer Systems
Publication date
2012-03-25
Fields of study
Computer Science
Identifiers
DOI 10.1145/2465351.2465355 arXiv 1203.5485
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

MapReduce
2020influential reference
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
2012influential reference
Reoptimizing Data Parallel Computing
2012cited by this paper
Recurring job optimization in scope
2012cited by this paper
Shark: fast data analysis using coarse-grained distributed memory
2012cited by this paper
SciBORQ: Scientific data management with Bounds On Runtime and Quality
2011cited by this paper
Online aggregation for large MapReduce jobs
2011influential reference
Optimal Random Sampling from Distributed Streams Revisited
2011cited by this paper
Spark: Cluster Computing with Working Sets
2010influential reference
Reining in the Outliers in Map-Reduce Clusters using Mantri
2010cited by this paper
A comparison of join algorithms for log processing in MaPreduce
2010cited by this paper
Interactive Analysis of Web-Scale Data
2009cited by this paper
Hive - A Warehousing Solution Over a Map-Reduce Framework
2009cited by this paper
Et al
2008cited by this paper
Improving MapReduce Performance in Heterogeneous Environments
2008cited by this paper
Optimized stratified sampling for approximate query processing
2007influential reference
Scalable approximate query processing with the DBO engine
2007cited by this paper
Data management projects at Google
2006cited by this paper
Window-aware load shedding for aggregation queries over data streams
2006cited by this paper
Dynamic sample selection for approximate query processing
2003cited by this paper
Approximate Query Processing: Taming the TeraBytes
2001cited by this paper
Sampling: Design and Analysis
2000influential reference
Informix under CONTROL: Online Query Processing
2000cited by this paper
Congressional samples for approximate answering of group-by queries
2000cited by this paper
PROMISE: Predicting Query Behavior to Enable Predictive Caching Strategies for OLAP Systems
2000cited by this paper
The Aqua approximate query answering system
1999cited by this paper
Join synopses for approximate query answering
1999influential reference
Ripple joins for online aggregation
1999cited by this paper
Approximate computation of multidimensional aggregates of sparse data using wavelets
1999cited by this paper
On random sampling over joins
1999cited by this paper
Optimization in operations research
1997cited by this paper
Online aggregation
1997influential reference

CITED BY

Synopsis-Alloyed Index for Exact and Approximate Query Processing
2026cites this paper
AHA: Scalable Alternative History Analysis for Operational Timeseries Applications
2026cites this paper
GRELA: Exploiting graph representation learning in effective approximate query processing
2025cites this paper
FAAQP: Fast and Accurate Approximate Query Processing based on Bitmap-augmented Sum-Product Network
2025cites this paper
Federated Approximate Query Processing Based on Deep Models
2025cites this paper
On Efficient Approximate Aggregate Nearest Neighbor Queries over Learned Representations
2025cites this paper
LakeVisage: Towards Scalable, Flexible and Interactive Visualization Recommendation for Data Discovery over Data Lakes
2025cites this paper
Hierarchical and Efficient Synopsis Construction for Bounded Approximate Query Processing
2025cites this paper
IDAT: An Interactive Data Exploration Tool
2025influential citation
Managing Data for Scalable and Interactive Event Sequence Visualization
2025cites this paper
DLAQP: A Data Lake-Based Healthcare Approximate Query Processing Framework
2025cites this paper
ConANN: Conformal Approximate Nearest Neighbor Search
2025cites this paper
SpareLLM: Automatically Selecting Task-Specific Minimum-Cost Large Language Models under Equivalence Constraint
2025cites this paper
Perception-aware Sampling for Scatterplot Visualizations
2025cites this paper
Heuristics for Energy-Efficient Instruction-Level Approximate Computing
2025cites this paper
PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees
2025cites this paper
Visualization-Oriented Progressive Time Series Transformation
2025cites this paper
Finding comparison insights in multidimensional datasets
2025cites this paper
A Decade of Systems for Human Data Interaction
2025cites this paper
Mayura: Exploiting Similarities in Motifs for Temporal Co-Mining
2025cites this paper
Adaptive Indexing for Approximate Query Processing in Exploratory Data Analysis
2025cites this paper
Niyama : Breaking the Silos of LLM Inference Serving
2025cites this paper
Approximation-First Timeseries Monitoring Query At Scale
2025cites this paper
Visualizing Big Data For Enhanced Exploration And Analysis
2025cites this paper
Physical Visualization Design: Decoupling Interface and System Design
2025cites this paper
SynopsisLake: Quality-aware Approximate Spatial Query Processing Using Data Synopses
2025cites this paper
GenIE - Simulator-Driven Iterative Data Exploration for Scientific Discovery
2025cites this paper
Holistic query Approximation via RL Modeling
2025cites this paper
Hippo: Accelerating Transaction Processing for Approximate Query Processing Engine with Sampling Semantics
2024cites this paper
Computing A Well-Representative Summary of Conjunctive Query Results
2024cites this paper
Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and Quality
2024cites this paper
Towards Establishing Guaranteed Error for Learned Database Operations
2024cites this paper
Constrained Approximate Query Processing with Error and Response Time-Bound Guarantees for Efficient Big Data Analytics
2024cites this paper
Guaranteeing an Exact Error Bound for Bounded Approximate Query Processing
2024cites this paper
ThalamusDB: Approximate Query Processing on Multi-Modal Data
2024cites this paper
Private Approximate Query over Horizontal Data Federation
2024cites this paper
GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables
2024cites this paper
Technical Perspective: Efficient and Reusable Lazy Sampling
2024cites this paper
PECJ: Stream Window Join on Disorder Data Streams with Proactive Error Compensation
2024cites this paper
Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data Exploration
2024influential citation
Analysis of Parallel Optimisation Strategies Based on MapReduce Models
2024cites this paper
HeavyCache: A Generic Sketch for Summarizing Data Streams
2024cites this paper
Interactive visual query of density maps on latent space via flow‐based models
2024cites this paper
PairwiseHist: Fast, Accurate, and Space-Efficient Approximate Query Processing with Data Compression
2024cites this paper
Learning Approximation Sets for Exploratory Queries
2024cites this paper
The Moments Method for Approximate Data Cube Queries
2024cites this paper
Generalized Measure-Biased Sampling and Priority Sampling
2024cites this paper
Kondo: Efficient Provenance-Driven Data Debloating
2024cites this paper
<inline-formula><tex-math notation="LaTeX">$\mathsf {CheetahTraj}$</tex-math><alternatives><mml:math><mml:mi mathvariant="sans-serif">CheetahTraj</mml:mi></mml:math><inline-graphic xlink:href="shen-ieq1-3387480.gif"/></alternatives></inline-formula>: Efficient Visualization for Large Trajectory Data
2024cites this paper
A blockchain datastore for scalable IoT workloads using data decaying
2024cites this paper
Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines
2024influential citation
Efficient and Reusable Lazy Sampling
2024influential citation
DiApprox: Differential Privacy-based Online Range Queries Approximation for Multidimensional Data
2024cites this paper
Bouncer: Admission Control with Response Time Objectives for Low-latency Online Data Systems
2024cites this paper
Learned Optimizer for Online Approximate Query Processing in Data Exploration
2024influential citation
The Data Lakehouse: Data Warehousing and More
2023cites this paper
Turbo: Effective Caching in Differentially-Private Databases
2023cites this paper
Anser: Adaptive Information Sharing Framework of AnalyticDB
2023cites this paper
π-means: Granular Approach towards Interactive Data Exploration
2023cites this paper
Practical Dynamic Extension for Sampling Indexes
2023cites this paper
RALF: Accuracy-Aware Scheduling for Feature Store Maintenance
2023cites this paper
MOST: Model-Based Compression with Outlier Storage for Time Series Data
2023cites this paper
Identification of Individual Hanwoo Cattle by Muzzle Pattern Images through Deep Learning
2023cites this paper
the Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation
2023influential citation
Viper: Interactive Exploration of Large Satellite Data✱✱
2023cites this paper
Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications
2023cites this paper
Rotary: A Resource Arbitration Framework for Progressive Iterative Analytics
2023cites this paper
Efficient Dynamic Weighted Set Sampling and Its Extension
2023cites this paper
LAQy: Efficient and Reusable Query Approximations via Lazy Sampling
2023cites this paper
On Join Sampling and the Hardness of Combinatorial Output-Sensitive Join Algorithms
2023cites this paper
SEIDEN: Revisiting Query Processing in Video Database Systems
2023cites this paper
A Step Toward Deep Online Aggregation
2023influential citation
Arya: Arbitrary Graph Pattern Mining with Decomposition-based Sampling
2023cites this paper
Secure Sampling for Approximate Multi-party Query Processing
2023cites this paper
Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques
2023cites this paper
An Experimental Analysis of Quantile Sketches over Data Streams
2023cites this paper
Streaming Weighted Sampling over Join Queries
2023cites this paper
Enhanced Featurization of Queries with Mixed Combinations of Predicates for ML-based Cardinality Estimation
2023cites this paper
ShadowAQP: Efficient Approximate Group-by and Join Query via Attribute-oriented Sample Size Allocation and Data Generation
2023cites this paper
Supporting Complex Query Time Enrichment For Analytics
2023influential citation
Tinycubes: A modular technology for interactive visual analysis of historical and continuously updated spatiotemporal data
2023cites this paper
Using machine learning to create and capture value in the business models of small and medium-sized enterprises
2023cites this paper
Tailorable Sampling for Progressive Visual Analytics
2023cites this paper
Towards Optimizing Storage Costs on the Cloud
2023cites this paper
SynopsisDB: Distributed Synopsis-based Data Processing System
2023influential citation
Tuple Bubbles: Learned Tuple Representations for Tunable Approximate Query Processing
2023cites this paper
Confidence Intervals for Private Query Processing
2023cites this paper
Active Sampling for Sparse Table by Bayesian Optimization with Adaptive Resolution
2023cites this paper
Query Optimization for Inference-Based Graph Databases
2023cites this paper
ORB: Empowering Graph Queries through Inference
2023cites this paper
Weighted Random Sampling over Joins
2022cites this paper
The Subnetwork Investigation of Scale-Free Networks Based on the Self-Similarity
2022cites this paper
Towards Observability for Machine Learning Pipelines
2022cites this paper
Big Data Technologies and Management
2022cites this paper
Electra: Conditional Generative Model based Predicate-Aware Query Approximation
2022cites this paper
Resource-aware adaptive indexing for in situ visual exploration and analytics
2022cites this paper
Multi-Query Optimization Revisited: A Full-Query Algebraic Method
2022cites this paper
Workload Prediction for Adaptive Approximate Query Processing
2022cites this paper
Exploiting Machine Learning Models for Approximate Query Processing
2022cites this paper
High-dimensional Data Cubes
2022cites this paper