LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication

Prajwal Singhania,Siddharth Singh,Lannie Dalton Hough,Akarsh Srivastava,Harshitha Menon,C. Jekel,A. Bhatele

Published 2025 in arXiv.org

ABSTRACT

As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Since all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9x-3.6x lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72x reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-11-12
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2511.09557 arXiv 2511.09557
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
2025cited by this paper
Characterizing Communication Patterns in Distributed Large Language Model Inference
2025influential reference
FlashMoE: Fast Distributed MoE in a Single Kernel
2025cited by this paper
Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping
2025influential reference
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
2025cited by this paper
Seesaw: High-throughput LLM Inference via Model Re-sharding
2025influential reference
Communication Compression for Tensor Parallel LLM Inference
2024cited by this paper
Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference
2024cited by this paper
Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
2024cited by this paper
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
2024cited by this paper
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
2024cited by this paper
NanoFlow: Towards Optimal Large Language Model Serving Throughput
2024cited by this paper
OpenAI o1 System Card
2024cited by this paper
The Rapid Adoption of Generative AI
2024cited by this paper
Efficient Memory Management for Large Language Model Serving with PagedAttention
2023cited by this paper
SGLang: Efficient Execution of Structured Language Model Programs
2023cited by this paper
A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs
2023cited by this paper
Pipit: Scripting the analysis of parallel execution traces
2023cited by this paper
DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
2022cited by this paper
Fast Inference from Transformers via Speculative Decoding
2022cited by this paper
Air quality in the New Delhi metropolis under COVID-19 lockdown
2022cited by this paper
Chain of Thought Prompting Elicits Reasoning in Large Language Models
2022cited by this paper
IN-DEPTH ANALYSIS
2021influential reference
AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning
2021cited by this paper
International Energy Agency
2019cited by this paper
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
2019cited by this paper
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
2018cited by this paper
Introducing OpenSHMEM: SHMEM for the PGAS community
2010cited by this paper
Improving the Performance of Collective Operations in MPICH
2003cited by this paper
The Communication Challenge for MPP: Intel Paragon and Meiko CS-2
1994cited by this paper
Tests
1928cited by this paper

CITED BY

Optimizing Agentic Language Model Inference via Speculative Tool Calls
2025cites this paper