We address inefficiencies in task scheduling, memory management, and scalability in GPU-resident sparse LU factorization with a two-level approach of sequentially scheduled coarse-grained blocks containing multiple fine-grained blocks managed with a lightweight static scheduler enabling multi-stream parallelism. Additionally, we design an intelligent memory caching mechanism for the fine-grained scheduler, which retains frequently accessed data in GPU memory. To further enhance scalability, we introduce a distributed memory design that partitions the input matrix using a 1D block-cyclic distribution and optimizes inter-GPU communication via NVLink. The multi-GPU design reaches a computational throughput of 6.46 TFLOP/s on four A100 GPUs, demonstrating promising scalability. This is up to 7x speedup over the latest SuperLU_DIST with 3D communication, 94x speedup over PanguLU, 16x speedup over PasTiX, and 10x speedup over our own coarse-grained dynamic scheduling implementation while reaching up to 21% of the A100’s theoretical peak performance.
Caracal: A GPU-Resident Sparse LU Solver with Lightweight Fine-Grained Scheduling
Jie Ren,Tingxuan Zhong,Yuxi Hong,Guofeng Feng,Xincheng Wang,Weile Jia,H. Ltaief,David E. Keyes
Published 2025 in International Conference on Software Composition
ABSTRACT
PUBLICATION RECORD
- Publication year
2025
- Venue
International Conference on Software Composition
- Publication date
2025-11-15
- Fields of study
Computer Science, Engineering
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-39 of 39 references · Page 1 of 1
CITED BY
Showing 1-2 of 2 citing papers · Page 1 of 1