A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits

Published 2015 in International Conference on Machine Learning

ABSTRACT

We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms. We propose an efficient algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-based formulation of this problem. We prove a finite time expected regret upper bound of order O(√Kln(K)T) for this algorithm and a general lower bound of order Ω(√KT). At the end, we provide experimental results using real data from information retrieval applications.

PUBLICATION RECORD

Publication year
2015
Venue
International Conference on Machine Learning
Publication date
2015-07-06
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1601.03855
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

MergeRUCB: A Method for Large-Scale Online Ranker Evaluation
2015cited by this paper
Reducing Dueling Bandits to Cardinal Bandits
2014cited by this paper
Relative confidence sampling for efficient on-line ranker evaluation
2014cited by this paper
Partial Monitoring - Classification, Regret Bounds, and Algorithms
2014cited by this paper
Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows
2014cited by this paper
A Survey of Preference-Based Online Learning with Bandit Algorithms
2014cited by this paper
Generic Exploration and K-armed Voting Bandits
2013cited by this paper
Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments
2013cited by this paper
Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
2013influential reference
Top-k Selection based on Adaptive Sampling of Noisy Preferences
2013cited by this paper
A near-optimal algorithm for finite partial-monitoring games against adversarial opponents
2013cited by this paper
Combinatorial Bandits
2012cited by this paper
Large-scale validation and analysis of interleaved search evaluation
2012cited by this paper
The K-armed Dueling Bandits Problem
2012cited by this paper
Beat the Mean Bandit
2011cited by this paper
Preference Learning
2010cited by this paper
An updated survey on the linear ordering problem for weighted or unweighted tournaments
2010cited by this paper
Interactively optimizing information retrieval systems as a dueling bandits problem
2009cited by this paper
Active exploration for learning rankings from clickthrough data
2007cited by this paper
Noisy binary search and its applications
2007cited by this paper
Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search
2007cited by this paper
LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval
2007cited by this paper
Prediction, learning, and games
2006cited by this paper
Finite-time Analysis of the Multiarmed Bandit Problem
2002cited by this paper
The Nonstochastic Multiarmed Bandit Problem
2002cited by this paper
Discrete Prediction Games with Arbitrary Feedback and Loss
2001cited by this paper
An EÆcient Boosting Algorithm for Combining Preferences
2001cited by this paper
Adaptive game playing using multiplicative weights
1999cited by this paper
An Efficient Boosting Algorithm for Combining Preferences
1998cited by this paper

CITED BY

Regularized Online RLHF with Generalized Bilinear Preferences
2026cites this paper
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
2025cites this paper
Active Human Feedback Collection via Neural Contextual Dueling Bandits
2025cites this paper
Reinforcement Learning from Adversarial Preferences in Tabular MDPs
2025cites this paper
Federated Linear Dueling Bandits
2025cites this paper
Heterogeneous Adversarial Play in Interactive Environments
2025cites this paper
Online Clustering of Dueling Bandits
2025cites this paper
DP-Dueling: Learning from Preference Feedback without Compromising User Privacy
2024cites this paper
Non-Stationary Dueling Bandits Under a Weighted Borda Criterion
2024cites this paper
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF
2024cites this paper
Principled Preferential Bayesian Optimization
2024cites this paper
Finding Bayesian Nash Equilibrium in DHR
2024cites this paper
Biased Dueling Bandits with Stochastic Delayed Feedback
2024cites this paper
Neural Dueling Bandits
2024cites this paper
Adversarial Multi-dueling Bandits
2024cites this paper
The Power of Active Multi-Task Learning in Reinforcement Learning from Human Feedback
2024cites this paper
Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback
2024cites this paper
When Can We Track Significant Preference Shifts in Dueling Bandits?
2023cites this paper
Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources
2023cites this paper
Identifying Copeland Winners in Dueling Bandits with Indifferences
2023cites this paper
Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems
2023cites this paper
Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons
2023cites this paper
Versatile Dueling Bandits: Best-of-both World Analyses for Learning from Relative Preferences
2022influential citation
Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences
2022influential citation
Exploiting Correlation to Achieve Faster Learning Rates in Low-Rank Preference Bandits
2022cites this paper
Dueling Convex Optimization with General Preferences
2022cites this paper
ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive Non-Stationary Dueling Bandits
2022influential citation
Stochastic Dueling Bandits with Adversarial Corruption
2021cites this paper
Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits
2021influential citation
Dueling RL: Reinforcement Learning with Trajectory Preferences
2021cites this paper
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability
2021cites this paper
Dueling Bandits with Adversarial Sleeping
2021cites this paper
Adversarial Dueling Bandits
2020influential citation
Best-item Learning in Random Utility Models with Subset Choices
2020cites this paper
Bandits in the Plackett-Luce Model
2019cites this paper
Duelling Bandits with Weak Regret in Adversarial Environments
2018influential citation
Advancements in Dueling Bandits
2018cites this paper
MergeDTS
2018influential citation
Preference-based Online Learning with Dueling Bandits: A Survey
2018influential citation
PAC Battling Bandits in the Plackett-Luce Model
2018cites this paper
PAC-Battling Bandits with Plackett-Luce: Tradeoff between Sample Complexity and Subset Size
2018cites this paper
Online Evaluation of Rankers Using Multileaving
2018cites this paper
Multi-dueling Bandits with Dependent Arms
2017cites this paper
Bandits multi-armés avec rétroaction partielle
2017cites this paper
Dueling bandits for online ranker evaluation
2017cites this paper
Online Learning for the Control of Human Standing via Spinal Cord Stimulation
2017cites this paper
Correlational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces
2017cites this paper
Towards Conversational Recommender Systems
2016cites this paper
Dynamic social cloud management scheme based on transformable Stackelberg game
2016cites this paper
Utility-based Dueling Bandits as a Partial Monitoring Game
2015cites this paper