Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem

Junpei Komiyama,J. Honda,H. Kashima,H. Nakagawa

Published 2015 in Annual Conference Computational Learning Theory

ABSTRACT

We study the $K$-armed dueling bandit problem, a variation of the standard stochastic bandit problem where the feedback is limited to relative comparisons of a pair of arms. We introduce a tight asymptotic regret lower bound that is based on the information divergence. An algorithm that is inspired by the Deterministic Minimum Empirical Divergence algorithm (Honda and Takemura, 2010) is proposed, and its regret is analyzed. The proposed algorithm is found to be the first one with a regret upper bound that matches the lower bound. Experimental comparisons of dueling bandit algorithms show that the proposed algorithm significantly outperforms existing ones.

PUBLICATION RECORD

Publication year
2015
Venue
Annual Conference Computational Learning Theory
Publication date
2015-06-08
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1506.02550
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

MergeRUCB: A Method for Large-Scale Online Ranker Evaluation
2015cited by this paper
Reducing Dueling Bandits to Cardinal Bandits
2014cited by this paper
Relative confidence sampling for efficient on-line ranker evaluation
2014influential reference
Generic Exploration and K-armed Voting Bandits
2013influential reference
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods
2013cited by this paper
Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
2013cited by this paper
The K-armed Dueling Bandits Problem
2012cited by this paper
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
2011cited by this paper
Beat the Mean Bandit
2011influential reference
Crowdsourcing Translation: Professional Quality from Non-Professionals
2011cited by this paper
LETOR: A benchmark collection for research on learning to rank for information retrieval
2010cited by this paper
A Bayesian interactive optimization approach to procedural animation design
2010cited by this paper
Learning Preference Models in Recommender Systems
2010cited by this paper
Bandits Games and Clustering Foundations
2010cited by this paper
An Asymptotically Optimal Bandit Algorithm for Bounded Support Models.
2010cited by this paper
Preference Learning in Recommender Systems
2009cited by this paper
Nantonac collaborative filtering: recommendation based on order responses
2003cited by this paper
Finite-time Analysis of the Multiarmed Bandit Problem
2002cited by this paper
Sample mean based index policies by O(log n) regret for the multi-armed bandit problem
1995cited by this paper
Adaptive treatment allocation and the multi-armed bandit problem
1987cited by this paper
Asymptotically Efficient Adaptive Allocation Rules
year unknowncited by this paper

CITED BY

Bandit Learning in Matching Markets with Relative Feedback
2026cites this paper
Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers
2026cites this paper
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
2026cites this paper
Reinforcement Learning from Adversarial Preferences in Tabular MDPs
2025cites this paper
Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design
2025cites this paper
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
2025cites this paper
Sample Complexity of Identifying the Nonredundancy of Nontransitive Games in Dueling Bandits
2025cites this paper
Fusing Reward and Dueling Feedback in Stochastic Bandits
2025influential citation
Active Human Feedback Collection via Neural Contextual Dueling Bandits
2025cites this paper
Achieving Nearly-Optimal Regret and Sample Complexity in Dueling Bandits with Applications in Online Recommendations
2025cites this paper
Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback
2025cites this paper
Online Clustering of Dueling Bandits
2025cites this paper
Federated Linear Dueling Bandits
2025cites this paper
Constrained Dueling Bandits for Edge Intelligence
2025cites this paper
Multimodal Bandits: Regret Lower Bounds and Optimal Algorithms
2025cites this paper
Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options
2025cites this paper
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
2025cites this paper
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF
2024cites this paper
Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents
2024cites this paper
On Weak Regret Analysis for Dueling Bandits
2024cites this paper
Optimal Design for Reward Modeling in RLHF
2024cites this paper
DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
2024cites this paper
Biased Dueling Bandits with Stochastic Delayed Feedback
2024influential citation
Neural Dueling Bandits
2024cites this paper
Adversarial Multi-dueling Bandits
2024cites this paper
Multi-Player Approaches for Dueling Bandits
2024influential citation
The Power of Active Multi-Task Learning in Reinforcement Learning from Human Feedback
2024cites this paper
Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling
2024influential citation
Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback
2024cites this paper
Feel-Good Thompson Sampling for Contextual Dueling Bandits
2024cites this paper
Non-Stationary Dueling Bandits Under a Weighted Borda Criterion
2024cites this paper
Dueling Optimization with a Monotone Adversary
2023cites this paper
Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit
2023cites this paper
Green Dueling Bandits
2023cites this paper
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits
2023cites this paper
Identifying Copeland Winners in Dueling Bandits with Indifferences
2023cites this paper
Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems
2023cites this paper
Contextual Bandits and Imitation Learning via Preference-Based Active Queries
2023cites this paper
When Can We Track Significant Preference Shifts in Dueling Bandits?
2023cites this paper
Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons
2023cites this paper
Versatile Dueling Bandits: Best-of-both World Analyses for Learning from Relative Preferences
2022influential citation
An Asymptotically Optimal Batched Algorithm for the Dueling Bandit Problem
2022influential citation
One Arrow, Two Kills: An Unified Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits
2022cites this paper
Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons
2022influential citation
Non-Stationary Dueling Bandits
2022influential citation
ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive Non-Stationary Dueling Bandits
2022cites this paper
Rate-Matching the Regret Lower-Bound in the Linear Quadratic Regulator with Unknown Dynamics
2022cites this paper
Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences
2022influential citation
Batched Dueling Bandits
2022influential citation
Dirichlet–Luce choice model for learning from interactions
2022cites this paper
Optimal Algorithms for Stochastic Contextual Preference Bandits
2021cites this paper
Stochastic Dueling Bandits with Adversarial Corruption
2021cites this paper
Dueling Bandits with Adversarial Sleeping
2021influential citation
Dueling Convex Optimization
2021cites this paper
Testification of Condorcet Winners in dueling bandits
2021cites this paper
Learning the Optimal Recommendation from Explorative Users
2021cites this paper
Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits
2021cites this paper
Dueling RL: Reinforcement Learning with Trajectory Preferences
2021cites this paper
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability
2021cites this paper
Comparing Robot and Human guided Personalization: Adaptive Exercise Robots are Perceived as more Competent and Trustworthy
2020cites this paper
Choice Bandits
2020cites this paper
Global optimization based on active preference learning with radial basis functions
2020cites this paper
Regret Minimization in Stochastic Contextual Dueling Bandits
2020cites this paper
DUELING BANDIT PROBLEMS
2020cites this paper
Bandit Algorithms
2020cites this paper
Adversarial Dueling Bandits
2020cites this paper
Combinatorial Pure Exploration of Dueling Bandit
2020influential citation
Learning to Diversify for E-commerce Search with Multi-Armed Bandit
2019cites this paper
Simple Algorithms for Dueling Bandits
2019cites this paper
Bandit Algorithms in Information Retrieval
2019cites this paper
Regret Minimisation in Multinomial Logit Bandits
2019cites this paper
Bandit algorithms in information retrieval evaluation and ranking
2019cites this paper
Combinatorial Bandits with Relative Feedback
2019cites this paper
A Bayesian Choice Model for Eliminating Feedback Loops
2019influential citation
Linear Stochastic Bandits with Heavy-Tailed Payoffs
2019cites this paper
Dueling Bandits with Qualitative Feedback
2018cites this paper
Advancements in Dueling Bandits
2018cites this paper
Factored Bandits
2018cites this paper
Efficient Online Learning under Bandit Feedback
2018cites this paper
MergeDTS
2018influential citation
Online Evaluation of Rankers Using Multileaving
2018cites this paper
On Incomplete Noisy Sorting
2018cites this paper
Merge Double Thompson Sampling for Large Scale Online Ranker Evaluation
2018influential citation
Duelling Bandits with Weak Regret in Adversarial Environments
2018cites this paper
Battle of Bandits
2018cites this paper
Preference-based Online Learning with Dueling Bandits: A Survey
2018influential citation
Correlational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces
2017influential citation
Dueling bandits for online ranker evaluation
2017cites this paper
Online Learning for the Control of Human Standing via Spinal Cord Stimulation
2017cites this paper
Bandits Dueling on Partially Ordered Sets
2017cites this paper
Theory of Randomized Optimization Heuristics (Dagstuhl Seminar 17191)
2017cites this paper
Multi-dueling Bandits with Dependent Arms
2017cites this paper
Minimal Exploration in Structured Stochastic Bandits
2017cites this paper
Dueling Bandits with Weak Regret
2017influential citation
Dueling Bandits with Dependent Arms
2016cites this paper
Dueling Bandits: Beyond Condorcet Winners to General Tournament Solutions
2016cites this paper
Verification Based Solution for Structured MAB Problems
2016cites this paper
Double Thompson Sampling for Dueling Bandits
2016cites this paper
Instance-dependent Regret Bounds for Dueling Bandits
2016cites this paper
Multi-Dueling Bandits and Their Application to Online Ranker Evaluation
2016influential citation