RingX: Scalable Parallel Attention for Long-Context Learning on HPC

Junqi Yin,M. Palash,M. Shankar,Feiyi Wang

Published 2025 in International Conference on Software Composition

ABSTRACT

The attention mechanism has become foundational for remarkable AI breakthroughs since the introduction of the Transformer, driving the demand for increasingly longer context to power frontier models such as large-scale reasoning language models and high-resolution image/video generators. However, its quadratic computational and memory complexities present substantial challenges. Current state-of-the-art parallel attention methods, such as ring attention, are widely adopted for long-context training but utilize a point-to-point communication strategy that fails to fully exploit the capabilities of modern HPC network architectures. In this work, we propose ringX, a scalable family of parallel attention methods optimized explicitly for HPC systems. By enhancing workload partitioning, refining communication patterns, and improving load balancing, ringX achieves up to 3.4 × speedup compared to conventional ring attention on the Frontier supercomputer. Optimized for both bi-directional and causal attention mechanisms, ringX demonstrates its effectiveness through training benchmarks of a Vision Transformer (ViT) on a climate dataset and a Generative Pre-Trained Transformer (GPT) model, Llama3 8B. Our method attains an end-to-end training speedup of approximately 1.5 × in both scenarios. To our knowledge, the achieved 38% model FLOPs utilization (MFU) for training Llama3 8B with a 1M-token sequence length on 4,096 GPUs represents one of the highest training efficiencies reported for long-context learning on HPC systems. Our code implementation is available at https://github.com/jqyin/ringX-attention.

PUBLICATION RECORD

Publication year
2025
Venue
International Conference on Software Composition
Publication date
2025-11-15
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.1145/3712285.3759859
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A foundation model for the Earth system
2025cited by this paper
DeepSeek-V3 Technical Report
2024cited by this paper
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
2024cited by this paper
The Llama 3 Herd of Models
2024influential reference
A Scalable Real-Time Data Assimilation Framework for Predicting Turbulent Atmosphere Dynamics
2024cited by this paper
System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
2024cited by this paper
USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
2024cited by this paper
ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability
2024cited by this paper
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
2024cited by this paper
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
2023cited by this paper
Striped Attention: Faster Ring Attention for Causal Transformers
2023cited by this paper
FORGE: Pre-Training Open Foundation Models for Science
2023cited by this paper
DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training
2023cited by this paper
Scaling Vision Transformers to 22 Billion Parameters
2023cited by this paper
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
2022influential reference
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
2021cited by this paper
BEiT: BERT Pre-Training of Image Transformers
2021cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020influential reference
Multi-Head Attention: Collaborate Instead of Concatenate
2020cited by this paper
The ERA5 global reanalysis
2020cited by this paper
Training data-efficient image transformers & distillation through attention
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
Horovod: fast and easy distributed deep learning in TensorFlow
2018cited by this paper
Improving Language Understanding by Generative Pre-Training
2018cited by this paper
Attention is All you Need
2017cited by this paper
Exascale Computing Technology Challenges
2010cited by this paper
A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence
2008cited by this paper
Data exploration of turbulence simulations using a database cluster
2007cited by this paper
Optimization of Collective Communication Operations in MPICH
2005influential reference

CITED BY

Session Summary Podcast: Session 22: Machine Learning: Training at Scale 2
2025cites this paper