BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Yunlong Hou,Fengzhuo Zhang,Cunxiao Du,Xuan Zhang,Jiachun Pan,Tianyu Pang,Chao Du,Vincent Y. F. Tan,Zhuoran Yang

Published 2025 in International Conference on Machine Learning

ABSTRACT

Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

PUBLICATION RECORD

Publication year
2025
Venue
International Conference on Machine Learning
Publication date
2025-05-21
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.48550/arXiv.2505.15141 arXiv 2505.15141
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
2024influential reference
SAM Decoding: Speculative Decoding via Suffix Automaton
2024cited by this paper
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
2024cited by this paper
A Theoretical Perspective for Speculative Decoding Algorithm
2024cited by this paper
Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits
2024cited by this paper
Almost Minimax Optimal Best Arm Identification in Piecewise Stationary Linear Bandits
2024cited by this paper
Accelerated Speculative Sampling Based on Tree Monte Carlo
2024cited by this paper
The Llama 3 Herd of Models
2024cited by this paper
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
2024influential reference
Mixture-of-Agents Enhances Large Language Model Capabilities
2024cited by this paper
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
2024cited by this paper
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
2024cited by this paper
Block Verification Accelerates Speculative Decoding
2024cited by this paper
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
2024influential reference
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding
2024cited by this paper
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
2024cited by this paper
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
2024influential reference
REST: Retrieval-Based Speculative Decoding
2023cited by this paper
Accelerating Large Language Model Decoding with Speculative Sampling
2023influential reference
Multiplier Bootstrap-based Exploration
2023cited by this paper
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
2023cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023cited by this paper
Online Speculative Decoding
2023cited by this paper
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
2023cited by this paper
SpecTr: Fast Speculative Decoding via Optimal Transport
2023cited by this paper
Fast Inference from Transformers via Speculative Decoding
2022influential reference
The Role of Contextual Information in Best Arm Identification
2021cited by this paper
Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks
2021cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
Bandit Algorithms
2020influential reference
Probabilistic Sequential Shrinking: A Best Arm Identification Algorithm for Stochastic Bandits with Corruptions
2020cited by this paper
Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit
2018cited by this paper
Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits
2018cited by this paper
Efficient Contextual Bandits in Non-stationary Worlds
2017cited by this paper
Near-Optimal Regret Bounds for Thompson Sampling
2017cited by this paper
A Tutorial on Thompson Sampling
2017cited by this paper
Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms
2014cited by this paper
Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards
2014cited by this paper
Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret
2012cited by this paper
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
2011influential reference
Improved Algorithms for Linear Stochastic Bandits
2011cited by this paper
Active learning in heteroscedastic noise
2010cited by this paper
The Nonstochastic Multiarmed Bandit Problem
2002influential reference
Finite-time Analysis of the Multi-armed Bandit Problem
2000influential reference
Regret bounds for prediction problems
1999cited by this paper
Some aspects of the sequential design of experiments
1952cited by this paper
Asymptotically Efficient Adaptive Allocation Rules
year unknowncited by this paper

CITED BY

A Component-Based Survey of Interactions between Large Language Models and Multi-Armed Bandits
2026cites this paper
Speculative Sampling with Reinforcement Learning
2026cites this paper
Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs
2025influential citation
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
2025cites this paper
Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
2025influential citation
TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
2025cites this paper
Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges
2025cites this paper