Automatic Hierarchical Parallelization of Linear Recurrences

Published 2018 in International Conference on Architectural Support for Programming Languages and Operating Systems

ABSTRACT

Linear recurrences encompass many fundamental computations including prefix sums and digital filters. Later result values depend on earlier result values in recurrences, making it a challenge to compute them in parallel. We present a new work- and space-efficient algorithm to compute linear recurrences that is amenable to automatic parallelization and suitable for hierarchical massively-parallel architectures such as GPUs. We implemented our approach in a domain-specific code generator that emits optimized CUDA code. Our evaluation shows that, for standard prefix sums and single-stage IIR filters, the generated code reaches the throughput of memory copy for large inputs, which cannot be surpassed. On higher-order prefix sums, it performs nearly as well as the fastest handwritten code from the literature. On tuple-based prefix sums and digital filters, our automatically parallelized code outperforms the fastest prior implementations.

PUBLICATION RECORD

Publication year
2018
Venue
International Conference on Architectural Support for Programming Languages and Operating Systems
Publication date
2018-03-19
Fields of study
Computer Science
Identifiers
DOI 10.1145/3173162.3173168
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Single-pass Parallel Prefix Scan with Decoupled Lookback
2016cited by this paper
Higher-order and tuple-based massively-parallel prefix sums
2016influential reference
Compiling high performance recursive filters
2015cited by this paper
A Performance Comparison of Sort and Scan Libraries for GPUs
2015cited by this paper
Discrete-time signal processing : Alan V. Oppenheim, 3rd edition
2011cited by this paper
Efficient Parallel Scan Algorithms for GPUs
2011cited by this paper
GPU-efficient recursive filtering and summed-area tables
2011cited by this paper
Fast scan algorithms on graphics processors
2008cited by this paper
Scan primitives for GPU computing
2007cited by this paper
A Work-Efficient Step-Efficient Prefix Sum Algorithm
2006cited by this paper
Fast Summed‐Area Table Generation and its Applications
2005cited by this paper
Discrete Time Signal Processing
2004cited by this paper
Digital Signal Processing: A Practical Guide for Engineers and Scientists
2002influential reference
StreamIt: A Language for Streaming Applications
2002cited by this paper
Prefix sums and their applications
1990influential reference
Scans as Primitive Parallel Operations
1989cited by this paper
Efficient multi-processor implementation of recursive digital filters
1986cited by this paper
Data parallel algorithms
1986cited by this paper
An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations
1973cited by this paper
A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations
1973cited by this paper
The Organization of Computations for Uniform Recurrence Equations
1967cited by this paper

CITED BY

Inductive Loop Analysis for Practical HPC Application Optimization
2025cites this paper
A Memory-Efficient and Computation-Balanced Lossy Compressor on Wafer-Scale Engine
2025cites this paper
GPU Lossy Compression for HPC Can Be Versatile and Ultra-Fast
2025cites this paper
CUSZP2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio
2024cites this paper
cuSZp: An Ultra-Fast GPU Error-Bounded Lossy Compression Framework with Optimized End-to-End Performance
2023cites this paper
Parallel Tiled Code for Computing General Linear Recurrence Equations
2021cites this paper
GPU efficient 1D and 3D recursive filtering
2021cites this paper
BiPart: a parallel and deterministic hypergraph partitioner
2020cites this paper
An efficient parallel strategy for high-cost prefix operation
2020cites this paper
Scaling out speculative execution of finite-state machines with parallel merge
2020cites this paper
Enabling prefix sum parallelism pattern for recurrences with principled function reconstruction
2019cites this paper