Caracal: A GPU-Resident Sparse LU Solver with Lightweight Fine-Grained Scheduling

Jie Ren,Tingxuan Zhong,Yuxi Hong,Guofeng Feng,Xincheng Wang,Weile Jia,H. Ltaief,David E. Keyes

Published 2025 in International Conference on Software Composition

ABSTRACT

We address inefficiencies in task scheduling, memory management, and scalability in GPU-resident sparse LU factorization with a two-level approach of sequentially scheduled coarse-grained blocks containing multiple fine-grained blocks managed with a lightweight static scheduler enabling multi-stream parallelism. Additionally, we design an intelligent memory caching mechanism for the fine-grained scheduler, which retains frequently accessed data in GPU memory. To further enhance scalability, we introduce a distributed memory design that partitions the input matrix using a 1D block-cyclic distribution and optimizes inter-GPU communication via NVLink. The multi-GPU design reaches a computational throughput of 6.46 TFLOP/s on four A100 GPUs, demonstrating promising scalability. This is up to 7x speedup over the latest SuperLU_DIST with 3D communication, 94x speedup over PanguLU, 16x speedup over PasTiX, and 10x speedup over our own coarse-grained dynamic scheduling implementation while reaching up to 21% of the A100’s theoretical peak performance.

PUBLICATION RECORD

  • Publication year

    2025

  • Venue

    International Conference on Software Composition

  • Publication date

    2025-11-15

  • Fields of study

    Computer Science, Engineering

  • Identifiers
  • External record

    Open on Semantic Scholar

  • Source metadata

    Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-39 of 39 references · Page 1 of 1