Caracal: A GPU-Resident Sparse LU Solver with Lightweight Fine-Grained Scheduling

Jie Ren,Tingxuan Zhong,Yuxi Hong,Guofeng Feng,Xincheng Wang,Weile Jia,H. Ltaief,David E. Keyes

Published 2025 in International Conference on Software Composition

ABSTRACT

We address inefficiencies in task scheduling, memory management, and scalability in GPU-resident sparse LU factorization with a two-level approach of sequentially scheduled coarse-grained blocks containing multiple fine-grained blocks managed with a lightweight static scheduler enabling multi-stream parallelism. Additionally, we design an intelligent memory caching mechanism for the fine-grained scheduler, which retains frequently accessed data in GPU memory. To further enhance scalability, we introduce a distributed memory design that partitions the input matrix using a 1D block-cyclic distribution and optimizes inter-GPU communication via NVLink. The multi-GPU design reaches a computational throughput of 6.46 TFLOP/s on four A100 GPUs, demonstrating promising scalability. This is up to 7x speedup over the latest SuperLU_DIST with 3D communication, 94x speedup over PanguLU, 16x speedup over PasTiX, and 10x speedup over our own coarse-grained dynamic scheduling implementation while reaching up to 21% of the A100’s theoretical peak performance.

PUBLICATION RECORD

Publication year
2025
Venue
International Conference on Software Composition
Publication date
2025-11-15
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.1145/3712285.3759792
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Accelerating Large-Scale Sparse LU Factorization for RF Circuit Simulation
2024cited by this paper
PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems
2023cited by this paper
Accelerating Sparse LU Factorization with Density-Aware Adaptive Matrix Multiplication for Circuit Simulation
2023cited by this paper
2.5 Million-Atom Ab Initio Electronic-Structure Simulation of Complex Metallic Heterostructures with DGDFT
2022cited by this paper
MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment
2021cited by this paper
Computational Fluid Dynamics
2020cited by this paper
A communication-avoiding 3D sparse triangular solver
2019cited by this paper
Heat Transfer
2018cited by this paper
A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices
2018cited by this paper
Dynamic GPU Parallel Sparse LU Factorization for Fast Circuit Simulation
2018cited by this paper
GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling
2015cited by this paper
A Distributed CPU-GPU Sparse Direct Solver
2014cited by this paper
An adaptive LU factorization algorithm for parallel circuit simulation
2012cited by this paper
Sparse LU factorization for parallel circuit simulation on GPU
2012cited by this paper
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
2011cited by this paper
Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems
2010cited by this paper
StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines
2010cited by this paper
Direct numerical simulation of turbulent channel flows using a stabilized finite element method
2009cited by this paper
Direct methods for sparse linear systems
2006cited by this paper
Algorithm 832: UMFPACK V4.3---an unsymmetric-pattern multifrontal method
2004cited by this paper
A column approximate minimum degree ordering algorithm
2004cited by this paper
An overview of SuperLU: Algorithms, implementation, and user interface
2003cited by this paper
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
2003cited by this paper
PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems
2002cited by this paper
A column approximate minimum degree ordering algorithm
2000cited by this paper
MUMPS : A General Purpose Distributed Memory Sparse Solver
2000cited by this paper
PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed 1D/2D Block Distributions
2000cited by this paper
An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination
1997cited by this paper
An Approximate Minimum Degree Ordering Algorithm
1996cited by this paper
Preconditioned Krylov solvers for BEA
1994cited by this paper
Exploiting Structural Symmetry in a Sparse Partial Pivoting Code
1993cited by this paper
Exploiting Structural Symmetry in Unsymmetric Sparse Symbolic Factorization
1992cited by this paper
DOMAIN DECOMPOSITION METHODS IN COMPUTATIONAL FLUID DYNAMICS
1991cited by this paper
Flame sheet starting estimates for counterflow diffusion flame problems
1987cited by this paper
Direct Methods for Sparse Matrices
1987cited by this paper
A New Approximate LU Factorization Scheme for the Reynolds-Averaged Navier-Stokes Equations
1986cited by this paper
The Multifrontal Solution of Indefinite Sparse Symmetric Linear
1983cited by this paper
Structural Analysis
1979cited by this paper
An Introduction to Fluid Dynamics
1968cited by this paper

CITED BY

Parallel Sparse and Data-Sparse Factorization-based Linear Solvers
2026cites this paper
Session Summary Podcast: Session 23: Algorithms: Sparse Matrix and Tensor Computation
2025cites this paper