(More) Efficient Reinforcement Learning via Posterior Sampling

Ian Osband,Daniel Russo,Benjamin Van Roy

Published 2013 in Neural Information Processing Systems

ABSTRACT

Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an O(τS/√AT) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

PUBLICATION RECORD

Publication year
2013
Venue
Neural Information Processing Systems
Publication date
2013-06-04
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1306.0940
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Learning to Optimize via Posterior Sampling
2013cited by this paper
Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
2012cited by this paper
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
2012cited by this paper
Thompson Sampling for Contextual Bandits with Linear Payoffs
2012cited by this paper
Further Optimal Regret Bounds for Thompson Sampling
2012cited by this paper
An Empirical Evaluation of Thompson Sampling
2011cited by this paper
Approaching Bayes-optimalilty using Monte-Carlo tree search
2011cited by this paper
Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence
2010influential reference
Optimism in reinforcement learning and Kullback-Leibler divergence
2010cited by this paper
A modern Bayesian look at the multi-armed bandit
2010cited by this paper
REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs
2009cited by this paper
Near-Bayesian exploration in polynomial time
2009cited by this paper
Near-optimal Regret Bounds for Reinforcement Learning
2008cited by this paper
An analysis of model-based Interval Estimation for Markov Decision Processes
2008influential reference
Bayesian sparse sampling for on-line reward optimization
2005cited by this paper
On the sample complexity of reinforcement learning.
2003cited by this paper
Near-Optimal Reinforcement Learning in Polynomial Time
2002cited by this paper
R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning
2001cited by this paper
A Bayesian Framework for Reinforcement Learning
2000influential reference
Optimal Adaptive Policies for Markov Decision Processes
1997cited by this paper
Stochastic Systems: Estimation, Identification, and Adaptive Control
1986cited by this paper
Asymptotically efficient adaptive allocation rules
1985cited by this paper
ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES
1933cited by this paper
Asymptotically Efficient Adaptive Allocation Rules
year unknowncited by this paper

CITED BY

No One Size Fits All: QueryBandits for Hallucination Mitigation
2026cites this paper
Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers
2026cites this paper
Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents
2026influential citation
Distributional Active Inference
2026influential citation
From Wasserstein to Maximum Mean Discrepancy Barycenters: A Novel Framework for Uncertainty Propagation in Model-Free Reinforcement Learning
2026cites this paper
VariBASed: Variational Bayes-Adaptive Sequential Monte-Carlo Planning for Deep Reinforcement Learning
2026cites this paper
Smart Exploration in Reinforcement Learning Using Bounded Uncertainty Models
2025cites this paper
Divide, Discover, Deploy: Factorized Skill Learning with Symmetry and Style Priors
2025cites this paper
Safe and Near-Optimal Control with Online Dynamics Learning
2025cites this paper
EVaDE : Event-Based Variational Thompson Sampling for Model-Based Reinforcement Learning
2025cites this paper
Functional Critics Are Essential for Actor-Critic: From Off-Policy Stability to Efficient Exploration
2025cites this paper
Exploration via Feature Perturbation in Contextual Bandits
2025cites this paper
Bayesian Optimization for Dynamic Pricing and Learning
2025cites this paper
Online Bayesian Risk-Averse Reinforcement Learning
2025cites this paper
Concurrent Learning with Aggregated States via Randomized Least Squares Value Iteration
2025cites this paper
Posterior Sampling for Reinforcement Learning on Graphs
2025influential citation
Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments
2025cites this paper
Reinforcement Learning from Multi-level and Episodic Human Feedback
2025cites this paper
Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem
2025cites this paper
Spectral Bellman Method: Unifying Representation and Exploration in RL
2025cites this paper
Sample-Efficient Reinforcement Learning From Human Feedback via Information-Directed Sampling
2025cites this paper
The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective
2025influential citation
Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions
2025cites this paper
Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL
2025cites this paper
Towards AI-based precision rehabilitation via contextual model-based reinforcement learning
2025cites this paper
Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits
2025cites this paper
Sample Efficient Exploration Policy for Asynchronous Q-Learning
2025cites this paper
Deep Actor-Critics with Tight Risk Certificates
2025cites this paper
Autonomous Landing of the Quadrotor on the Mobile Platform via Meta Reinforcement Learning
2025cites this paper
Stochastic Path Planning in Correlated Obstacle Fields
2025cites this paper
Priors Matter: Addressing Misspecification in Bayesian Deep Q-Learning
2025cites this paper
Toward Efficient Exploration by Large Language Model Agents
2025influential citation
QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting
2025cites this paper
Q-learning with Posterior Sampling
2025cites this paper
Adaptive Exploration for Multi-Reward Multi-Policy Evaluation
2025cites this paper
Best Policy Learning from Trajectory Preference Feedback
2025cites this paper
Sequential Bayesian Replacement With Unknown Transition Probabilities
2025cites this paper
A plug-and-play fully on-the-job real-time reinforcement learning algorithm for a direct-drive tandem-wing experiment platforms under multiple random operating conditions
2025cites this paper
Bayesian Meta-Reinforcement Learning with Laplace Variational Recurrent Networks
2025cites this paper
Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems
2025cites this paper
Studying Exploration in RL: An Optimal Transport Analysis of Occupancy Measure Trajectories
2025influential citation
Epistemically-guided forward-backward exploration
2025influential citation
Outcome-based Exploration for LLM Reasoning
2025cites this paper
Minimax Optimal Reinforcement Learning with Quasi-Optimism
2025influential citation
ALLC: autonomous lightweight distributed ledger constructor for securing IoT information
2025cites this paper
Large Language Models Think Too Fast To Explore Effectively
2025cites this paper
Leveraging priors on distribution functions for multi-arm bandits
2025cites this paper
UAMDP: Uncertainty-Aware Markov Decision Process for Risk-Constrained Reinforcement Learning from Probabilistic Forecasts
2025cites this paper
Learning Task Belief Similarity with Latent Dynamics for Meta-Reinforcement Learning
2025cites this paper
Provably Efficient and Agile Randomized Q-Learning
2025cites this paper
No-Regret Thompson Sampling for Finite-Horizon Markov Decision Processes with Gaussian Processes
2025influential citation
The Confusing Instance Principle for Online Linear Quadratic Control
2025cites this paper
Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes
2025influential citation
Regret Lower Bounds for Decentralized Multi-Agent Stochastic Shortest Path Problems
2025cites this paper
Random Latent Exploration for Deep Reinforcement Learning
2024cites this paper
Regularized Parameter Uncertainty for Improving Generalization in Reinforcement Learning
2024cites this paper
Optimistic Q-learning for average reward and episodic reinforcement learning
2024cites this paper
Demystifying Linear MDPs and Novel Dynamics Aggregation Framework
2024cites this paper
More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling
2024cites this paper
Beyond Optimism: Exploration With Partially Observable Rewards
2024cites this paper
Heavy-Tailed Reinforcement Learning With Penalized Robust Estimator
2024cites this paper
Satisficing Exploration for Deep Reinforcement Learning
2024influential citation
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
2024cites this paper
Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling
2024cites this paper
Model-based Policy Optimization under Approximate Bayesian Inference
2024cites this paper
Hierarchical reinforcement Thompson composition
2024cites this paper
Reinforcement Learning and Regret Bounds for Admission Control
2024influential citation
Preparing for Black Swans: The Antifragility Imperative for Machine Learning
2024influential citation
HyperAgent: A Simple, Scalable, Efficient and Provable Reinforcement Learning Framework for Complex Environments
2024cites this paper
Function-space Parameterization of Neural Networks for Sequential Learning
2024cites this paper
Latent Plan Transformer for Trajectory Abstraction: Planning as Latent Space Inference
2024cites this paper
Model-Free Active Exploration in Reinforcement Learning
2024influential citation
Model-Free Approximate Bayesian Learning for Large-Scale Conversion Funnel Optimization
2024cites this paper
Utilizing Maximum Mean Discrepancy Barycenter for Propagating the Uncertainty of Value Functions in Reinforcement Learning
2024cites this paper
Deep Exploration with PAC-Bayes
2024cites this paper
Sequential Decision Making with Expert Demonstrations under Unobserved Heterogeneity
2024cites this paper
Opponent Modeling with In-context Search
2024cites this paper
Isoperimetry is All We Need: Langevin Posterior Sampling for RL with Sublinear Regret
2024influential citation
SAD: State-Action Distillation for In-Context Reinforcement Learning under Random Policies
2024cites this paper
Uncertainty-based Offline Variational Bayesian Reinforcement Learning for Robustness under Diverse Data Corruptions
2024cites this paper
Risk-Aware Decision Making in Restless Bandits: Theory and Algorithms for Planning and Learning
2024cites this paper
Near-Optimal Reinforcement Learning with Shuffle Differential Privacy
2024cites this paper
Practical Bayesian Algorithm Execution via Posterior Sampling
2024cites this paper
EVOLvE: Evaluating and Optimizing LLMs For Exploration
2024cites this paper
Random Policy Enables In-Context Reinforcement Learning within Trust Horizons
2024cites this paper
Generalized Bayesian deep reinforcement learning
2024cites this paper
Rethinking Discount Regularization: New Interpretations, Unintended Consequences, and Solutions for Regularization in Reinforcement Learning
2024cites this paper
Implicit Human Perception Learning in Complex and Unknown Environments
2024cites this paper
SHIRE: Enhancing Sample Efficiency using Human Intuition in REinforcement Learning
2024cites this paper
Goal-Conditioned Hierarchical Reinforcement Learning With High-Level Model Approximation
2024cites this paper
Improving Thompson Sampling via Information Relaxation for Budgeted Multi-armed Bandits
2024cites this paper
How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories
2024influential citation
Online Planning of Power Flows for Power Systems Against Bushfires Using Spatial Context
2024cites this paper
Prior-dependent analysis of posterior sampling reinforcement learning with function approximation
2024cites this paper
Information-directed policy sampling for episodic Bayesian Markov decision processes
2024cites this paper
Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling
2024cites this paper
Online MDP with Transition Prototypes: A Robust Adaptive Approach
2024cites this paper
Robot Motion Planning with Uncertainty and Urgency
2023influential citation
Thompson sampling for improved exploration in GFlowNets
2023cites this paper
Supervised Pretraining Can Learn In-Context Reinforcement Learning
2023influential citation