TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

Published 2026 in Unknown venue

ABSTRACT

TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch) and explicit unfused baselines across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-03-02
Fields of study
Computer Science
Identifiers
arXiv 2603.01960
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Towards a European HPC/AI ecosystem: a community-driven report
2025cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023influential reference
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
2023cited by this paper
PaLM: Scaling Language Modeling with Pathways
2022cited by this paper
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
2022cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021influential reference
A Survey of Transformers
2021cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020influential reference
Efficient Transformers: A Survey
2020cited by this paper
Language Models are Few-Shot Learners
2020influential reference
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
Attention is All you Need
2017cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper

CITED BY

No citing papers are available for this paper.