Scheduling irregular parallel computations on hierarchical caches

G. Blelloch,Jeremy T. Fineman,Phillip B. Gibbons,H. Simhadri

Published 2011 in ACM Symposium on Parallelism in Algorithms and Architectures

ABSTRACT

For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds. The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

PUBLICATION RECORD

Publication year
2011
Venue
ACM Symposium on Parallelism in Algorithms and Architectures
Publication date
2011-06-04
Fields of study
Computer Science
Identifiers
DOI 10.1145/1989493.1989553
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Low depth cache-oblivious algorithms
2010cited by this paper
Oblivious algorithms for multicores and network of processors
2010influential reference
Resource Oblivious Sorting on Multicores
2010cited by this paper
The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation
2010cited by this paper
The Cache Complexity of Multithreaded Cache Oblivious Algorithms
2009cited by this paper
A bridging model for multi-core computing
2008cited by this paper
Cache-efficient dynamic programming algorithms for multicores
2008cited by this paper
Provably good multicore cache performance for divide-and-conquer algorithms
2008cited by this paper
Fundamental parallel algorithms for private-cache chip multiprocessors
2008cited by this paper
Network-Oblivious Algorithms
2007cited by this paper
Concurrent cache-oblivious b-trees
2005cited by this paper
The uniform memory hierarchy model of computation
2005cited by this paper
Effectively sharing a cache among threads
2004cited by this paper
The Data Locality of Work Stealing
2002influential reference
Introduction to Algorithms, 2nd edition.
2001cited by this paper
Cache-oblivious algorithms
1999cited by this paper
Cache-Oblivious Algorithms
1999cited by this paper
Introduction to algorithms
1996cited by this paper
DAG-consistent distributed shared memory
1996cited by this paper
Submachine Locality in the Bulk Synchronous Setting (Extended Abstract)
1996cited by this paper
LogP: towards a realistic model of parallel computation
1993cited by this paper
Modeling parallel computers as memory hierarchies
1993cited by this paper
A bridging model for parallel computation
1990cited by this paper
Polymorphic Arrays: A Novel VLSI Layout for Systolic Computers
1984cited by this paper

CITED BY

Teaching Parallel Algorithms Using the Binary-Forking Model
2024cites this paper
Automatic Parallelism Management
2024cites this paper
Cache-Oblivious Parallel Convex Hull in the Binary Forking Model
2023cites this paper
Provably Fast and Space-Efficient Parallel Biconnectivity
2023cites this paper
Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism
2023cites this paper
High-Performance and Flexible Parallel Algorithms for Semisort and Related Problems
2023cites this paper
Many Sequential Iterative Algorithms Can Be Parallel and Work-efficient
2022cites this paper
A Work-E cient Parallel Algorithm for Longest Increasing Subsequence
2022cites this paper
Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work Stealing
2022influential citation
Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficient
2022cites this paper
Dynamic Boolean Formula Evaluation
2021cites this paper
Processor-Aware Cache-Oblivious Algorithms✱
2021cites this paper
Low-Span Parallel Algorithms for the Binary-Forking Model
2021cites this paper
Responsive Parallel Computation
2021cites this paper
Provably space-efficient parallel functional programming
2021cites this paper
Open problems in queueing theory inspired by datacenter computing
2021cites this paper
Parallel In-Place Algorithms: Theory and Practice
2021cites this paper
Efficient Stepping Algorithms and Implementations for Parallel Shortest Paths
2021cites this paper
Modernizing Models and Management of the Memory Hierarchy for Non-Volatile Memory
2021cites this paper
Analysis of Work-Stealing and Parallel Cache Complexity
2021cites this paper
Data Oblivious Algorithms for Multicores
2020cites this paper
Low-Depth Parallel Algorithms for the Binary-Forking Model without Atomics
2020cites this paper
Balanced Partitioning of Several Cache-Oblivious Algorithms
2020cites this paper
Disentanglement in nested-parallel programs
2019cites this paper
Fairness in responsive parallelism
2019cites this paper
Optimal Parallel Algorithms in the Binary-Forking Model
2019cites this paper
Optimal Parallel Algorithms in the Binary-Forking Model
2019cites this paper
Hierarchical memory management for mutable state
2018cites this paper
Heartbeat scheduling: provable efficiency for nested parallelism
2018cites this paper
Theory and Engineering of Scheduling Parallel Jobs
2018cites this paper
A NUMA-Aware Provably-Efficient Task-Parallel Platform Based on the Work-First Principle
2018cites this paper
Power models, energy models and libraries for energy-efficient concurrent data structures and algorithms
2018cites this paper
Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms
2017cites this paper
Shared-Memory Parallelism Can be Simple, Fast, and Scalable
2017influential citation
Work-Stealing for Multi-socket Architecture
2017cites this paper
Provably Efficient Scheduling of Dynamically Allocating Programs on Parallel Cache Hierarchies
2017cites this paper
Responsive parallel computation: bridging competitive and cooperative threading
2017cites this paper
Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling
2017cites this paper
Remote Memory References at Block Granularity
2017cites this paper
Analysis of classic algorithms on highly-threaded many-core architectures
2017cites this paper
Laying Tiles Ornamentally: An approach to structuring container traversals
2016cites this paper
Experimental Analysis of Space-Bounded Schedulers
2016cites this paper
ICE: A General and Validated Energy Complexity Model for Multithreaded Algorithms
2016cites this paper
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
2016cites this paper
Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing
2016cites this paper
Latency-Hiding Work Stealing
2016cites this paper
Hierarchical memory management for parallel programs
2016cites this paper
Algorithmic λ-Calculus for the Design , Analysis , and Implementation of Parallel Algorithms
2016cites this paper
Space-Bounded Async Scheduling : A UPC++ Extension
2016influential citation
Automatic Discovery of Efficient Divide-&-Conquer Algorithms for Dynamic Programming Problems
2016cites this paper
Construcción paralela de estructuras de datos sucintas
2016cites this paper
Multicore triangle computations without tuning
2015cites this paper
Two-Level Main Memory Co-Design: Multi-threaded Algorithmic Primitives, Analysis, and Simulation
2015cites this paper
Coupling Memory and Computation for Locality Management
2015cites this paper
Using Symmetry to Schedule Classical Matrix Multiplication
2015influential citation
Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency
2015cites this paper
Big data: Scale down, scale up, scale out
2015cites this paper
Experimental analysis of space-bounded schedulers
2014cites this paper
Modeling Algorithm Performance on Highly-threaded Many-core Architectures
2014cites this paper
Measurement of the latency parameters of the Multi-BSP model: a multicore benchmarking approach
2014cites this paper
Program-Centric Cost Models for Locality and Parallelism
2013influential citation
Simple , Fast and Scalable Parallel Algorithms for Shared Memory ( Thesis Proposal )
2013cites this paper
Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures
2013influential citation
Survey of the sequential and parallel models of computation : Technical report LUSY-2012/02
2013cites this paper
Empirical Evaluation of the Parallel Distribution Sweeping Framework on Multicore Architectures
2013cites this paper
Betweenness centrality: algorithms and implementations
2013cites this paper
Ligra: a lightweight graph processing framework for shared memory
2013cites this paper
Parallel triangle counting in massive streaming graphs
2013cites this paper
Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures
2013cites this paper
Program-centric cost models for locality
2013cites this paper
Design and implementation of a customizable work stealing scheduler
2013cites this paper
A Bridging Model for Branch-and-Bound Algorithms on Multi-core Architectures
2012cites this paper
CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures
2012cites this paper
Parallel and I/O efficient set covering algorithms
2012cites this paper
Efficient Resource Oblivious Algorithms for Multicores with False Sharing
2012cites this paper
A Memory Access Model for Highly-threaded Many-core Architectures
2012cites this paper
Techniques for Parallel Memory Hierarchies
2011cites this paper
Resource Oblivious Sorting on Multicores
2010cites this paper
Oblivious algorithms for multicores and network of processors
2010cites this paper
Heartbeat scheduling: provable eﬀiciency for nested parallelism
year unknowncites this paper
Under Consideration for Publication in J. Functional Programming Oracle-guided Scheduling for Controlling Granularity in Implicitly Parallel Languages
year unknowncites this paper
WashU Scholarly Repository WashU Scholarly
year unknowncites this paper
Padua Research Archive - Institutional Repository
year unknowncites this paper