SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits

Subhojyoti Mukherjee,Qiaomin Xie,Josiah P. Hanna,R. Nowak

Published 2023 in International Conference on Artificial Intelligence and Statistics

ABSTRACT

In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the value of the target policy. We then use this formulation to derive the optimal allocation of samples per action during data collection. We then introduce a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal design and derive its regret with respect to the optimal design. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.

PUBLICATION RECORD

Publication year
2023
Venue
International Conference on Artificial Intelligence and Statistics
Publication date
2023-01-29
Fields of study
Mathematics, Computer Science, Economics
Identifiers
DOI 10.48550/arXiv.2301.12357 arXiv 2301.12357
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Multi-task Representation Learning for Pure Exploration in Bilinear Bandits
2023cited by this paper
Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Making
2023cited by this paper
Efficient and Interpretable Bandit Algorithms
2023cited by this paper
Bandit Learning with General Function Classes: Heteroscedastic Noise and Variance-dependent Regret Bounds
2022influential reference
ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling
2022cited by this paper
Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs
2022cited by this paper
Safe Optimal Design with Applications in Policy Learning
2022cited by this paper
Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits
2022cited by this paper
Safe Exploration for Efficient Policy Evaluation and Comparison
2022influential reference
Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP
2021cited by this paper
Safe Data Collection for Offline and Online Policy Learning
2021cited by this paper
Nearly Optimal Algorithms for Level Set Estimation
2021cited by this paper
Supplementary material to “Online A-Optimal Design and Active Linear Regression”
2021influential reference
A Unified Approach to Translate Classical Bandit Algorithms to Structured Bandits
2021cited by this paper
Improved Algorithms for Agnostic Pool-based Active Classification
2021cited by this paper
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning
2021cited by this paper
Optimal Off-Policy Evaluation from Multiple Logging Policies
2020cited by this paper
An Empirical Process Approach to the Union Bound: Practical Algorithms for Combinatorial and Linear Bandits
2020cited by this paper
Bandit Algorithms
2020cited by this paper
Chernoff Sampling for Active Testing and Extension to Active Regression
2020cited by this paper
Deep Jump Learning for Off-Policy Evaluation in Continuous Treatment Settings
2020cited by this paper
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes
2020cited by this paper
Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking
2020cited by this paper
Doubly robust off-policy evaluation with shrinkage
2019cited by this paper
Sequential Experimental Design for Transductive Linear Bandits
2019cited by this paper
Information Directed Sampling and Bandits with Heteroscedastic Noise
2018influential reference
Data-Efficient Policy Evaluation Through Behavior Policy Search
2017influential reference
Active Heteroscedastic Regression
2017influential reference
Online Controlled Experiments and A/B Testing
2017cited by this paper
Minimal Exploration in Structured Stochastic Bandits
2017cited by this paper
Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe
2017cited by this paper
Structured Best Arm Identification with Fixed Confidence
2017cited by this paper
Active Learning for Accurate Estimation of Linear Models
2017cited by this paper
OFFER: Off-Environment Reinforcement Learning
2017cited by this paper
Efficient-UCBV: An Almost Optimal Algorithm using Variance Estimates
2017cited by this paper
Off-policy evaluation for slate recommendation
2016cited by this paper
Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
2016cited by this paper
Residual Weighted Learning for Estimating Individualized Treatment Rules
2015cited by this paper
Adaptive strategy for stratified Monte Carlo sampling
2015influential reference
Toward Minimax Off-policy Value Estimation
2015cited by this paper
Online Learning to Sample
2015influential reference
Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees
2014cited by this paper
Doubly Robust Policy Evaluation and Optimization
2014cited by this paper
High-Dimensional Statistics
2014influential reference
An Affine Invariant Linear Convergence Analysis for Frank-Wolfe Algorithms
2013cited by this paper
Counterfactual reasoning and learning systems: the example of computational advertising
2012influential reference
Minimax Number of Strata for Online Stratified Sampling Given Noisy Samples
2012cited by this paper
Improved Algorithms for Linear Stochastic Bandits
2011cited by this paper
Finite Time Analysis of Stratified Sampling for Monte Carlo
2011influential reference
Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms
2010cited by this paper
Modeling wine preferences by data mining from physicochemical properties
2009cited by this paper
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits
2009cited by this paper
On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario
2008cited by this paper
Introduction to Nonparametric Estimation
2008influential reference
Active Learning in Multi-armed Bandits
2008influential reference
Linearly Parameterized Bandits
2008cited by this paper
Interactive machine learning
2003cited by this paper
Finite-time Analysis of the Multiarmed Bandit Problem
2002cited by this paper
Reinforcement Learning: An Introduction
1998cited by this paper
Inequalities for the trace of matrix product
1994cited by this paper
Optimal Design of Experiments
1994influential reference
Theory Of Optimal Experiments
1972cited by this paper
The Equivalence of Two Extremum Problems
1960cited by this paper
A MULTIVARIATE GENERALIZATION OF TCHEBICHEV'S INEQUALITY
1958cited by this paper
ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES
1933cited by this paper
25th Annual Conference on Learning Theory Analysis of Thompson Sampling for the Multi-armed Bandit Problem
year unknowncited by this paper
Asymptotically Efficient Adaptive Allocation Rules
year unknowncited by this paper

CITED BY

Experimental Design for Active Transductive Inference in Large Language Models
2024cites this paper
SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular MDP
2024cites this paper
Optimal Design for Human Feedback
2024cites this paper
Experimental Designs for Heteroskedastic Variance
2023cites this paper
Multi-task Representation Learning for Pure Exploration in Bilinear Bandits
2023influential citation