Column-Oriented Storage Techniques for MapReduce

Avrilia Floratou,J. Patel,E. Shekita,Sandeep Tata

Published 2011 in Proceedings of the VLDB Endowment

ABSTRACT

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs.We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude.

PUBLICATION RECORD

Publication year
2011
Venue
Proceedings of the VLDB Endowment
Publication date
2011-04-01
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.14778/1988776.1988778 arXiv 1105.4252
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems
2011cited by this paper
Automatic Optimization for MapReduce Programs
2011cited by this paper
Cheetah
2010cited by this paper
Hadoop++
2010cited by this paper
Speeding Up Queries in Column Stores - A Case for Compression
2010cited by this paper
MapReduce: a flexible data processing tool
2010cited by this paper
Interactive Analysis of Web-Scale Data
2009cited by this paper
A comparison of approaches to large-scale data analysis
2009cited by this paper
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
2009cited by this paper
Self-organizing tuple reconstruction in column-stores
2009cited by this paper
MapReduce: simplified data processing on large clusters
2008cited by this paper
Column-stores vs. row-stores: how different are they really?
2008cited by this paper
Pig latin: a not-so-foreign language for data processing
2008cited by this paper
Automatic Optimization of Parallel Dataflow Programs
2008cited by this paper
Materialization Strategies in a Column-Oriented DBMS
2007cited by this paper
Integrating compression and execution in column-oriented database systems
2006cited by this paper
MonetDB/X100: Hyper-Pipelining Query Execution
2005cited by this paper
Weaving Relations for Cache Performance
2001cited by this paper
Skip lists: a probabilistic alternative to balanced trees
1989cited by this paper

CITED BY

Dynamic Job Ordering and Slot Configurations for MapReduce Workloads
2024cites this paper
Grouping Time Series for Efficient Columnar Storage
2023cites this paper
Map Reduce Overview and Functionality
2021cites this paper
Using Vectorized Execution to Improve SQL Query Performance on Spark
2021cites this paper
Conclusions and Outlook
2020cites this paper
Introduction
2020cites this paper
Large-Scale Graph Processing Systems
2020cites this paper
Equi-Depth Histogram Construction Methodology for Big Data Tools
2020cites this paper
Big Data 2.0 Processing Systems: A Systems Overview
2020cites this paper
General-Purpose Big Data Processing Systems
2020cites this paper
Large-Scale Processing Systems of Structured Data
2020cites this paper
Large-Scale Stream Processing Systems
2020cites this paper
Large-Scale Machine/Deep Learning Frameworks
2020cites this paper
Columnar Storage Formats
2019cites this paper
Parallel Join Algorithms in MapReduce
2019cites this paper
A cost-based storage format selector for materialized results in big data frameworks
2019cites this paper
Actionable Program Analyses for Improving Software Performance
2019cites this paper
DYRS: Bandwidth-Aware Disk-to-Memory Migration of Cold Data in Big-Data File Systems
2019cites this paper
Storage Format Selection and Optimization for Materialized Intermediate Results in Data-Intensive Flows
2019cites this paper
Performance Evaluation and Optimization of Multi-Dimensional Indexes in Hive
2018cites this paper
Arvand: A Method to Integrate Multidimensional Data Sources Into Big Data Analytic Structures
2018cites this paper
Leveraging MapReduce with Column-Oriented Stores: Study of Solutions and Benefits
2018cites this paper
Content-Aware Partial Compression for Textual Big Data Analysis in Hadoop
2018cites this paper
Chabok: a Map-Reduce based method to solve data warehouse problems
2018cites this paper
The big data system, components, tools, and technologies: a survey
2018cites this paper
Typed Linear Algebra for Efficient Analytical Querying
2018cites this paper
Albis: High-Performance File Format for Big Data Systems
2018influential citation
A Cost-based Storage Format Selector for Materialization in Big Data Frameworks
2018cites this paper
Ignem: Upward Migration of Cold Data in Big Data File Systems
2018cites this paper
Wide Table Layout Optimization based on Column Ordering and Duplication
2017cites this paper
Big data analytics for security and criminal investigations
2017cites this paper
Split reading of redundant datasets on datanodes using Hadoop framework
2017cites this paper
Skipping-oriented Data Design for Large-Scale Analytics
2017cites this paper
Atrak: a MapReduce-based data warehouse for big data
2017cites this paper
Cross-Language Optimizations in Big Data Systems: A Case Study of SCOPE
2017cites this paper
Systematic review of crime data analytics
2017cites this paper
Research on storage and query of massive multidimensional data
2017cites this paper
Minimizing the MakeSpan of Multiple MapReduce Jobs through Job Ordering Technique
2017cites this paper
An experimental comparison of complex object implementations for big data systems
2017cites this paper
Achieving Consumable Big Data Analytics by Distributing Data Mining Algorithms
2017influential citation
Data Organization and Curation in Big Data
2017cites this paper
Business Intelligence and Analytics: Big Systems for Big Data
2017cites this paper
Live Data Stream Classification for Reducing Query Processing Time: Design and Analysis
2017cites this paper
State of the art in MapReduce: issues and approaches
2017cites this paper
Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop
2016cites this paper
Specification and optimization of analytical data flows
2016cites this paper
Skipping-oriented Partitioning for Columnar Layouts
2016cites this paper
MapReduce based parallel data processing for drug-drug interaction prediction
2016cites this paper
Equi-depth Histogram Construction for Big Data with Quality Guarantees
2016cites this paper
Analyzing Cost Parameters Affecting Map Reduce Application Performance
2016cites this paper
A STUDY OF THE OPTIMISTIC MAPREDUCE TECHNIQUES FOR ENERGY MINIMIZATION AND PERFORMANCE ENHANCEMENT FOR MULTICORE CLOUD COMPUTING APPLICATIONS
2016cites this paper
Big Data 2.0 Processing Systems: Taxonomy and Open Challenges
2016cites this paper
What can cause big data being slow?
2016cites this paper
E-Learning : Distributed Processing of Large Data Sets with Parallel Algorithm
2016cites this paper
Vortex : taking SQL-on-Hadoop to the next level
2016cites this paper
VectorH: Taking SQL-on-Hadoop to the Next Level
2016cites this paper
FP-Hadoop: Efficient processing of skewed MapReduce jobs
2016cites this paper
Performance analysis of shared-nothing SQL-on-Hadoop frameworks based on columnar database systems
2016cites this paper
MapReduce: Review and open challenges
2016influential citation
A Framework for Criminal Network Analysis Using Big Data
2016cites this paper
The Six Pillars for Building Big Data Analytics Ecosystems
2016cites this paper
KCGS-Store: A Columnar Storage Based on Group Sorting of Key Columns
2016cites this paper
ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results
2016cites this paper
Efficient query processing framework for big data warehouse: an almost join-free approach
2015cites this paper
Efficient indexing for big data in Hadoop MapReduce and main memory databases
2015cites this paper
AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data
2015cites this paper
The Cost-Efficient Awareness for Cloud MapReduce
2015cites this paper
SmartFetch: Efficient Support for Selective Queries
2015cites this paper
HM: A Column-Oriented MapReduce System on Hybrid Storage
2015cites this paper
High performance CDR processing with MapReduce
2015cites this paper
Optimizing OLAP Cubes Construction by Improving Data Placement on Multi-nodes Clusters
2015cites this paper
A scheme of structured data compression and query on Hadoop platform
2015cites this paper
Efficient data layouts for cost-optimized Map-Reduce operations
2015cites this paper
A Lightweight Evaluation Framework for Table Layouts in MapReduce Based Query Systems
2015cites this paper
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
2015cites this paper
Implementing a Linear Algebra Approach to Data Processing
2015cites this paper
SSFile: A novel column-store for efficient data analysis in Hadoop-based distributed systems
2015cites this paper
Building High Performance Data Analytics Systems based on Scale-out Models
2015cites this paper
R e search A rticle HADOOP VS BIG DATA * Regha , S. and Dr. Manimekalai , M.
2015cites this paper
MapReduce: State-of-the-Art and Research Directions
2014cites this paper
HadoopM: A Message-Enabled Data Processing System on Large Clusters
2014cites this paper
Big Data Processing Systems
2014cites this paper
Major technical advancements in apache hive
2014cites this paper
SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures
2014influential citation
A Survey: Hadoop, Managing Large Data very efficiently
2014cites this paper
Towards dependability and performance benchmarking for cloud computing services. (Vers la fiabilité et la performance des services de Cloud Computing)
2014cites this paper
MapReduce Family of Large-Scale Data-Processing Systems
2014cites this paper
Efficient Support for Selective MapReduce
2014cites this paper
HC-Store: putting MapReduce’s foot in two camps
2014cites this paper
Distributed data management using MapReduce
2014cites this paper
Replication and Data Placement in Distributed Key-Value Stores
2014influential citation
Scalable big graph processing in MapReduce
2014cites this paper
Optimization of Massively Parallel Data Flows
2014cites this paper
A framework for building hypercubes using MapReduce
2014cites this paper
A Compatible LZMA ORC-Based Optimization for High Performance Big Data Load
2014cites this paper
Reducing MapReduce Abstraction Costs for Text-centric Applications
2014cites this paper
Efficient Support for Selective MapReduce Queries
2014cites this paper
Present Situation and Prospect of Data Warehouse Architecture under the Background of Big Data
2013cites this paper
A survey of large-scale analytical query processing in MapReduce
2013cites this paper
INDREX: in-database distributional relation extraction
2013cites this paper