A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

Peng Chen,Mohamed Wahib,Shin'ichiro Takizawa,Ryousei Takano,S. Matsuoka

Published 2019 in International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5× faster than Nvidia's NPP on V100 and P100 GPUs.

PUBLICATION RECORD

Publication year
2019
Venue
International Conference for High Performance Computing, Networking, Storage and Analysis
Publication date
2019-07-14
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.1145/3295500.3356162 arXiv 1907.06154
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Dissecting the NVidia Turing T4 GPU via Microbenchmarking
2019cited by this paper
RegDem: Increasing GPU Performance via Shared Memory Register Spilling
2019cited by this paper
Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
2019cited by this paper
Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures
2019cited by this paper
Associative Instruction Reordering to Alleviate Register Pressure
2018influential reference
Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks
2018cited by this paper
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
2018cited by this paper
Register optimizations for stencils on GPUs
2018influential reference
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
2018cited by this paper
Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations
2018influential reference
Warp-Consolidation: A Novel Execution Model for GPUs
2018cited by this paper
Efficient Algorithms for the Summed Area Tables Primitive on GPUs
2018cited by this paper
Tessellating Stencils
2017cited by this paper
In-datacenter performance analysis of a tensor processing unit
2017cited by this paper
Fast segmented sort on GPUs
2017cited by this paper
Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning
2017cited by this paper
Automatically scheduling halide image processing pipelines
2016influential reference
Modeling the Performance of 2.5D Blocking of 3D Stencil Code on GPUs
2016cited by this paper
Fast Multiplication in Binary Fields on GPUs via Register Cache
2016influential reference
Implementation of the DWT in a GPU through a Register-based Strategy
2015cited by this paper
Numerical Solution of Stochastic Differential Equations
2015cited by this paper
Compiler-Directed Transformation for Higher-Order Stencils
2015cited by this paper
Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates
2014cited by this paper
cuDNN: Efficient Primitives for Deep Learning
2014cited by this paper
Optimizing Stencil Computations for NVIDIA Kepler GPUs
2014influential reference
Benchmarking the Memory Hierarchy of Modern GPUs
2014cited by this paper
Optimizing and Auto-Tuning Iterative Stencil Loops for GPUs with the In-Plane Method
2013cited by this paper
Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs
2013cited by this paper
Polyhedral parallel code generation for CUDA
2013cited by this paper
Vectorized Higher Order Finite Difference Kernels
2012cited by this paper
A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers
2012cited by this paper
Parallel Prefix Sum (Scan) with CUDA
2011cited by this paper
Register packing for cyclic reduction: a case study
2011cited by this paper
The Polyhedral Model Is More Widely Applicable Than You Think
2010cited by this paper
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
2010cited by this paper
Demystifying GPU microarchitecture through microbenchmarking
2010cited by this paper
3D finite difference computation on GPUs using CUDA
2009cited by this paper
Control System Components
2008influential reference
Effective automatic parallelization of stencil computations
2007cited by this paper
Advanced FPGA design
2007cited by this paper
Intel threading building blocks - outfitting C++ for multi-core processor parallelism
2007cited by this paper
A novel systolic array structure for DCT
2005cited by this paper
A LIBRARY FOR DOING POLYHEDRAL OPERATIONS
2000cited by this paper
PolyLib: A Library for Manipulating Parameterized Polyhedra
1999cited by this paper
Advanced ASIC chip synthesis : using Synopsys[R] Design Compiler[TM] Physical Compiler[TM] and PrimeTime[R]
1999influential reference
Code generation in the polytope model
1998cited by this paper
Numerical Solution of Stochastic Differential Equations
1992cited by this paper
On Synthesizing Optimal Family of Linear Systolic Arrays for Matrix Multiplication
1991cited by this paper
Simple systolic arrays for discrete cosine transform
1990cited by this paper
ADVIS: A Software Package for the Design of Systolic Arrays
1987cited by this paper
The systematic design of systolic arrays
1987cited by this paper
The Design of Optimal Systolic Arrays
1985cited by this paper
OPTIMISED BIT LEVEL SYSTOLIC ARRAY FOR CONVOLUTION.
1984cited by this paper
Systolic Multipliers for Finite Fields GF(2m)
1984cited by this paper
Why systolic architectures?
1982cited by this paper
Let's Design Algorithms for VLSI Systems
1979influential reference
A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations
1973cited by this paper
Numerical Solution of Differential Equations
1953cited by this paper
the Parallel Computing Landscape
year unknowncited by this paper

CITED BY

FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core Units
2025cites this paper
Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled Registers
2025cites this paper
SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation
2025cites this paper
SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping
2025cites this paper
A Sample-Free Compilation Framework for Efficient Dynamic Tensor Computation
2025cites this paper
POPA: Expressing High and Portable Performance across Spatial and Vector Architectures for Tensor Computations
2024cites this paper
ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores
2024cites this paper
Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library
2023cites this paper
A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code
2023cites this paper
Revisiting Temporal Blocking Stencil Optimizations
2023cites this paper
Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors
2023cites this paper
ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code
2023cites this paper
PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications
2022influential citation
Optimizing Depthwise Separable Convolution Operations on GPUs
2022cites this paper
PPOAccel: A High-Throughput Acceleration Framework for Proximal Policy Optimization
2022cites this paper
Persistent Kernels for Iterative Memory-bound GPU Applications
2022influential citation
Performance portability in a real world application: PHAST applied to Caffe
2022cites this paper
QSketch: GPU-Aware Probabilistic Sketch Data Structures
2022cites this paper
Flynn’s Reconciliation
2021cites this paper
An efficient GPU implementation and scaling for higher-order 3D stencils
2021cites this paper
Node-Aware Stencil Communication for Heterogeneous Supercomputers
2020cites this paper
cuDTW++: Ultra-Fast Dynamic Time Warping on CUDA-Enabled GPUs
2020cites this paper
OpenMP: Portable Multi-Level Parallelism on Modern Systems: 16th International Workshop on OpenMP, IWOMP 2020, Austin, TX, USA, September 22–24, 2020, Proceedings
2020cites this paper
Supporting Data Shuffle Between Threads in OpenMP
2020cites this paper
Systolic Computing on GPUs for Productive Performance
2020cites this paper
SPTCStencil: Using Sparse Tensor Cores for Stencil Computation
year unknowncites this paper