A Tutorial on Thompson Sampling

Daniel Russo,Benjamin Van Roy,Abbas Kazerouni,Ian Osband

Published 2017 in Found. Trends Mach. Learn.

ABSTRACT

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.

PUBLICATION RECORD

Publication year
2017
Venue
Found. Trends Mach. Learn.
Publication date
2017-07-07
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.1561/2200000070 arXiv 1707.02038
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Choosing a good toolkit, II: Bayes-rule based heuristics
2020cited by this paper
Online Network Revenue Management Using Thompson Sampling
2018cited by this paper
Parallelised Bayesian Optimisation via Thompson Sampling
2018cited by this paper
Coordinated Exploration in Concurrent Reinforcement Learning
2018cited by this paper
Exploiting the Natural Exploration In Contextual Bandits
2017cited by this paper
Choosing a Good Toolkit : Reinforcement Learning
2017cited by this paper
Thompson Sampling for Stochastic Control: The Finite Parameter Case
2017cited by this paper
Learning to Price with Reference Effects
2017cited by this paper
Time-Sensitive Bandit Learning and Satisficing Thompson Sampling
2017cited by this paper
On Optimistic versus Randomized Exploration in Reinforcement Learning
2017cited by this paper
Asynchronous Parallel Bayesian Optimisation via Thompson Sampling
2017cited by this paper
Learning Unknown Markov Decision Processes: A Thompson Sampling Approach
2017cited by this paper
Thompson Sampling for the MNL-Bandit
2017cited by this paper
Information Directed Sampling for Stochastic Bandits with Graph Feedback
2017cited by this paper
Convergence of Langevin MCMC in KL-divergence
2017cited by this paper
Ensemble Sampling
2017cited by this paper
Deep Exploration via Randomized Value Functions
2017influential reference
An Efficient Bandit Algorithm for Realtime Multivariate Optimization
2017cited by this paper
Customer Acquisition via Display Advertising Using Multi-Armed Bandit Experiments
2017cited by this paper
Online Algorithms For Parameter Mean And Variance Estimation In Dynamic Regression Models
2016cited by this paper
Simple Bayesian Algorithms for Best Arm Identification
2016cited by this paper
Sampling from a strongly log-concave distribution with the Unadjusted Langevin Algorithm
2016cited by this paper
Why is Posterior Sampling Better than Optimism for Reinforcement Learning?
2016cited by this paper
High-dimensional Bayesian inference via the unadjusted Langevin algorithm
2016cited by this paper
Linear Thompson Sampling Revisited
2016cited by this paper
Deep Exploration via Bootstrapped DQN
2016cited by this paper
Bayesian Reinforcement Learning: A Survey
2015cited by this paper
Sampling from a Log-Concave Distribution with Projected Langevin Monte Carlo
2015cited by this paper
Multi-scale exploration of convex functions and bandit convex optimization
2015cited by this paper
Cascading Bandits: Learning to Rank in the Cascade Model
2015cited by this paper
Reinforcement learning improves behaviour from evaluative feedback
2015cited by this paper
Efficient Thompson Sampling for Online Matrix-Factorization Recommendation
2015cited by this paper
Multi-armed bandit experiments in the online service economy
2015cited by this paper
Near-optimal Reinforcement Learning in Factored MDPs
2014cited by this paper
An Information-Theoretic Analysis of Thompson Sampling
2014cited by this paper
Learning to Optimize via Information-Directed Sampling
2014cited by this paper
Thompson sampling with the online bootstrap
2014cited by this paper
Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics
2014cited by this paper
Thompson Sampling for Learning Parameterized Markov Decision Processes
2014cited by this paper
Model-based Reinforcement Learning and the Eluder Dimension
2014cited by this paper
Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards
2014cited by this paper
Generalization and Exploration via Randomized Value Functions
2014cited by this paper
LASER: a scalable response prediction platform for online advertising
2014cited by this paper
Computational advertising: the linkedin way
2013cited by this paper
Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors
2013cited by this paper
Eluder Dimension and the Sample Complexity of Optimistic Exploration
2013cited by this paper
(More) Efficient Reinforcement Learning via Posterior Sampling
2013cited by this paper
Thompson Sampling for Complex Online Problems
2013cited by this paper
Learning to Optimize via Posterior Sampling
2013cited by this paper
Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
2013cited by this paper
Kullback–Leibler upper confidence bounds for optimal sequential allocation
2012cited by this paper
Further Optimal Regret Bounds for Thompson Sampling
2012cited by this paper
On Bayesian Upper Confidence Bounds for Bandit Problems
2012cited by this paper
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems
2012cited by this paper
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
2012cited by this paper
Thompson Sampling for Contextual Bandits with Linear Payoffs
2012cited by this paper
Improved Algorithms for Linear Stochastic Bandits
2011cited by this paper
An Empirical Evaluation of Thompson Sampling
2011cited by this paper
Bayesian Learning via Stochastic Gradient Langevin Dynamics
2011cited by this paper
Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine
2010influential reference
A modern Bayesian look at the multi-armed bandit
2010cited by this paper
A contextual-bandit approach to personalized news article recommendation
2010cited by this paper
Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting
2009cited by this paper
Index to Volume 63
2009influential reference
The Knowledge-Gradient Policy for Correlated Normal Beliefs
2009cited by this paper
A Knowledge-Gradient Policy for Sequential Information Collection
2008cited by this paper
An experimental comparison of click position-bias models
2008cited by this paper
Stochastic Linear Optimization under Bandit Feedback
2008influential reference
Near-optimal Regret Bounds for Reinforcement Learning
2008cited by this paper
Multi-armed bandits in metric spaces
2008cited by this paper
Linearly Parameterized Bandits
2008cited by this paper
Finite-time Analysis of the Multiarmed Bandit Problem
2002cited by this paper
Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise
2002cited by this paper
A Bayesian Framework for Reinforcement Learning
2000cited by this paper
Exploration and inference in learning from reinforcement
1998cited by this paper
Introduction to Reinforcement Learning
1998cited by this paper
Reinforcement Learning: An Introduction
1998cited by this paper
Optimal scaling of discrete approximations to Langevin diffusions
1998cited by this paper
Exponential convergence of Langevin distributions and their discrete approximations
1996cited by this paper
Explaining the Gibbs Sampler
1992cited by this paper
Multi‐Armed Bandit Allocation Indices
1990cited by this paper
Multi-armed Bandit Allocation Indices
1989cited by this paper
The Multi-Armed Bandit Problem: Decomposition and Computation
1987cited by this paper
Asymptotically efficient adaptive allocation rules
1985cited by this paper
A dynamic allocation index for the discounted multiarmed bandit problem
1979cited by this paper
On the Theory of Apportionment
1935cited by this paper
ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES
1933cited by this paper
–armed Bandits
year unknowncited by this paper
25th Annual Conference on Learning Theory Analysis of Thompson Sampling for the Multi-armed Bandit Problem
year unknowncited by this paper

CITED BY

Bayesian Response‐Adaptive Randomization for Cluster Randomized Controlled Trials
2026cites this paper
Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?
2026cites this paper
Advancing green lithium-ion battery supply chains: A two-stage framework integrating reinforcement learning and mathematical modeling
2026cites this paper
EvoRoute: Experience-Driven Self-Routing LLM Agent Systems
2026cites this paper
Bandit Allocational Instability
2026cites this paper
In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior
2026cites this paper
TNCOA: Efficient Exploration via Observation‐Action Constraint on Trajectory‐Based Intrinsic Reward
2026cites this paper
Bayesian Online Model Selection
2026cites this paper
Statistical Reinforcement Learning in the Real World: A Survey of Challenges and Future Directions
2026cites this paper
In-Context Reinforcement Learning From Suboptimal Historical Data
2026influential citation
Bandit Social Learning with Exploration Episodes
2026cites this paper
Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents
2026cites this paper
TSPINN: Thompson sampling-based adaptive training for physics-informed neural networks
2026cites this paper
Laplacian Kernelized Bandit
2026cites this paper
Integrating Contextual Causal Deep Networks and LLM-Guided Policies for Sequential Decision-Making
2026cites this paper
Expected Improvement via Gradient Norms
2026cites this paper
Adaptive Experimental Design Using Shrinkage Estimators
2026cites this paper
MoEPlan: A Lazy Learned Query-Selection Optimizer via Mixture of Optimizer Experts
2026cites this paper
Beyond the Node: Clade-level Selection for Efficient MCTS in Automatic Heuristic Design
2026cites this paper
Certificate-Guided Pruning for Stochastic Lipschitz Optimization
2026cites this paper
Integrating Multi-Armed Bandit, Active Learning, and Distributed Computing for Scalable Optimization
2026influential citation
Evolution of behavioral flexibility and the forming and breaking of habits
2026influential citation
Regime-Adaptive Bayesian Optimization via Dirichlet Process Mixtures of Gaussian Processes
2026cites this paper
A Survey of Multi-Armed Bandit Algorithms: From Theoretical Foundations to Modern Applications
2026cites this paper
Compression Efficiency and Structural Learning as a Computational Model of DLN Cognitive Stages
2026cites this paper
Sensorimotor Mechanisms of Decisions and Actions.
2026cites this paper
Optimism Stabilizes Thompson Sampling for Adaptive Inference
2026cites this paper
Towards Provable Emergence of In-Context Reinforcement Learning
2025cites this paper
Thompson Sampling vs Ols: Analyzing Tariff Impact on Export Decisions
2025cites this paper
LLM Trainer: Automated Robotic Data Generating via Demonstration Augmentation using LLMs
2025cites this paper
Graph Random Features for Scalable Gaussian Processes
2025cites this paper
Swallow Search Algorithm (SWSO): A Swarm Intelligence Optimization Approach Inspired by Swallow Bird Behavior
2025cites this paper
Dual-driven optimization of collaborative multi-agent via case learning and curiosity
2025cites this paper
GLMFuzz: Vulnerability Knowledge Guided Prompting for Efficient Network Protocol Fuzzing
2025cites this paper
Latent Preference Bandits
2025cites this paper
Passive Detection of Fat Users in WiFi Networks using Thompson Sampling
2025cites this paper
Game-Theoretic and Reinforcement Learning-Based Cluster Head Selection for Energy-Efficient Wireless Sensor Network
2025cites this paper
TS-Insight: Visualizing Thompson Sampling for Verification and XAI
2025influential citation
Dual-Directed Algorithm Design for Efficient Pure Exploration
2025cites this paper
Spectral Bayesian Optimization Using a Physics-Informed Rational Szegö Kernel for Microwave Design
2025cites this paper
Energy-Efficient Routing Algorithm for Wireless Sensor Networks: A Multi-Agent Reinforcement Learning Approach
2025cites this paper
Fine-tuning LLMs with variational Bayesian last layer for high-dimensional Bayesian optimization
2025cites this paper
Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards
2025cites this paper
Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback
2025cites this paper
FreeWavm: Enhanced WebAssembly Runtime Fuzzing Guided by Parse Tree Mutation and Snapshot
2025cites this paper
Poisson Midpoint Method for Log Concave Sampling: Beyond the Strong Error Lower Bounds
2025cites this paper
Resource Management for Stochastic Parallel Synchronous Tasks: Bandits to the Rescue
2025cites this paper
Counterfactual Explanation of Shapley Value in Data Coalitions
2025cites this paper
RCFuzzer: Recommendation-based Collaborative Fuzzer
2025cites this paper
Sampling from Gaussian processes: a tutorial and applications in global sensitivity analysis and optimization
2025cites this paper
AutoIndexer: A Reinforcement Learning-Enhanced Index Advisor Towards Scaling Workloads
2025cites this paper
Practical Contextual Bandits for Large-Scale Structured Discrete Constrained Optimization Problems
2025cites this paper
Optimizing Spectrum and Energy Efficiency in a Wifi-Based Industrial IoT Network
2025cites this paper
SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models
2025cites this paper
Information, certainty, and learning
2025cites this paper
Stochastic Information Geometry: Characterization of Fréchet Means of Gaussian Fields in Poisson Networks
2025cites this paper
Knowledge graph-aided Bayesian active learning for top-K genetic interaction discovery
2025cites this paper
Incentivized Lipschitz Bandits
2025cites this paper
MoDAF: A Multi-objective Divide-and-Conquer Parameter Tuning Framework for CGRAs
2025cites this paper
A Minimalist Bayesian Framework for Stochastic Optimization
2025influential citation
Optimizing Ad Recommendations Using A Bayesian Multi-Armed Bandit Approach
2025cites this paper
Online Bayesian Risk-Averse Reinforcement Learning
2025cites this paper
Optimization of Epsilon-Greedy Exploration
2025cites this paper
Latent Thompson Sampling-Based mmWave Receive Beam Measurement and Selection to Tackle User Orientation Changes and Mobility
2025cites this paper
Multilevel Monte Carlo for asymptotically efficient path tracing
2025cites this paper
Parallel constrained Bayesian optimization via batched Thompson sampling with enhanced active learning process for reliability-based design optimization
2025cites this paper
Bayesian adaptive randomization in the I-SPY2 sequential multiple assignment randomized trial
2025cites this paper
A Survey on Causal Inference‐Driven Data Bias Optimization in Recommendation Systems: Principles, Opportunities and Challenges
2025cites this paper
Using LLMs to improve RL policies in personalized health adaptive interventions
2025cites this paper
Toward Efficient Exploration by Large Language Model Agents
2025cites this paper
Addressing Missing Data Issue for Diffusion-based Recommendation
2025influential citation
DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning
2025cites this paper
Gaussian Process with Vine Copula-Based Context Modeling for Contextual Multi-Armed Bandits
2025cites this paper
Reducing Computational Time in Pixel-Based Path Planning for GMA-DED by Using Multi-Armed Bandit Reinforcement Learning Algorithm
2025cites this paper
Computationally and Sample Efficient Safe Reinforcement Learning Using Adaptive Conformal Prediction
2025cites this paper
Adversarial Geometric Attacks for 3D Point Cloud Object Tracking
2025cites this paper
Data-Driven Non-Parametric Model Learning and Adaptive Control of MDPs with Borel spaces: Identifiability and Near Optimal Design
2025influential citation
Distributionally Robust Multi-Agent Reinforcement Learning for Dynamic Chute Mapping
2025cites this paper
RL-MAB-Based Resource Allocation for Efficient Bandwidth Utilization in Industrial IoT Networks
2025cites this paper
Like Adding a Small Weight to a Scale About to Tip: Personalizing Micro-Financial Incentives for Digital Wellbeing
2025cites this paper
Finding Friendly Neighborhood: Optimal D2D Relaying in mmWave IoT Networks
2025cites this paper
Learning-Based Bundling Strategy for Two Products Under Uncertain Consumer's Valuations
2025cites this paper
Steering Generative Models with Experimental Data for Protein Fitness Optimization
2025cites this paper
BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
2025cites this paper
Scalable and Interpretable Contextual Bandits: A Literature Review and Retail Offer Prototype
2025cites this paper
Optimizing URLLC in Open RAN: A Deep Reinforcement Learning-Based Trade-Off Analysis
2025cites this paper
Online radar screening pulse width allocation strategy based on non-stationary bandit
2025cites this paper
Robust Satisficing Gaussian Process Bandits Under Adversarial Attacks
2025cites this paper
A Review of Simulation Optimization with Connection to Artificial Intelligence
2025cites this paper
Designing digital health interventions with causal inference and multi-armed bandits: a review
2025cites this paper
Provably Learning from Language Feedback
2025cites this paper
Martingale Posterior Neural Networks for Fast Sequential Decision Making
2025cites this paper
Strategic Scaling of Test-Time Compute: A Bandit Learning Approach
2025cites this paper
Agentic Personalisation of Cross-Channel Marketing Experiences
2025cites this paper
A novel active learning stochastic Kriging metamodel for improving reliability and stability of additive manufacturing processes
2025cites this paper
Dynamic Care Unit Placements Under Unknown Demand with Learning
2025cites this paper
Enhancing Adaptive Behavioral Interventions with LLM Inference from Participant-Described States
2025influential citation
BandFuzz: An ML-powered Collaborative Fuzzing Framework
2025cites this paper
An Efficient and Accurate Random Forest Node-Splitting Algorithm Based on Dynamic Bayesian Methods
2025cites this paper
Age-of-information minimization under energy harvesting and non-stationary environment
2025cites this paper