SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization

Arya Tschand,Muhammad A. Awad,Ryan Swann,Kesavan Ramakrishnan,Jeffrey Ma,Keith Lowery,Ganesh Dasika,V. Reddi

Published 2025 in arXiv.org

ABSTRACT

Large language models (LLMs) have shown progress in GPU kernel performance engineering using inefficient search-based methods that optimize around runtime. Any existing approach lacks a key characteristic that human performance engineers rely on for near-optimal utilization -- hardware-awareness. By leveraging the workload's specific memory access patterns, architecture specifications, filtered profiling logs, and reflections on historical performance, we can make software-level optimizations that are tailored to the underlying hardware. SwizzlePerf automatically generates spatial optimizations for GPU kernels on disaggregated architectures by giving LLMs explicit hardware-awareness. For a GEMM kernel, SwizzlePerf takes less than 5 minutes to generate the same hardware-specific optimal swizzling pattern that took expert performance engineers 2 weeks to find. On a suite of 10 diverse ML and Science kernels, SwizzlePerf can generate swizzling patterns for 9 of the kernels that achieve up to a 2.06x speedup and 70% improvement in L2 hit rate. This work is the first of many steps toward systematically creating hardware-aware LLM performance engineering agents.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-08-27
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.48550/arXiv.2508.20258 arXiv 2508.20258
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Kevin: Multi-Turn RL for Generating CUDA Kernels
2025cited by this paper
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture
2025cited by this paper
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
2025cited by this paper
GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization
2025cited by this paper
KernelBench: Can LLMs Write Efficient GPU Kernels?
2025cited by this paper
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
2024cited by this paper
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AI
2024cited by this paper
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
2024cited by this paper
11.1 AMD InstinctTM MI300 Series Modular Chiplet Package – HPC and AI Accelerator for Exa-Class Systems
2024cited by this paper
Learning Performance-Improving Code Edits
2023cited by this paper
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
2023cited by this paper
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
2023cited by this paper
Rigorous Evaluation of Computer Processors with Statistical Model Checking
2023cited by this paper
Competition-level code generation with AlphaCode
2022cited by this paper
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
2022cited by this paper
Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP
2022cited by this paper
Multi-Head Attention: Collaborate Instead of Concatenate
2020cited by this paper
Ansor : Generating High-Performance Tensor Programs for Deep Learning
2020cited by this paper
Learning to optimize halide with tree search and random programs
2019cited by this paper
Learning to Optimize Tensor Programs
2018cited by this paper
OpenTuner: An extensible framework for program autotuning
2014cited by this paper
Measuring Energy and Power with PAPI
2012cited by this paper
An integrated GPU power and performance model
2010cited by this paper
The Design and Implementation of FFTW3
2005cited by this paper
Automated empirical optimizations of software and the ATLAS project
2001cited by this paper

CITED BY

Towards Automated Kernel Generation in the Era of LLMs
2026cites this paper
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
2025cites this paper
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
2025cites this paper
Integrating Performance Tools in Model Reasoning for GPU Kernel Optimization
2025cites this paper
cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution
2025cites this paper
KernelBand: Steering LLM-based Kernel Optimization via Hardware-Aware Multi-Armed Bandits
2025cites this paper
Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators
2025cites this paper