Data-Efficient Policy Evaluation Through Behavior Policy Search

Josiah P. Hanna,P. Thomas,P. Stone,S. Niekum

Published 2017 in International Conference on Machine Learning

ABSTRACT

We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for the optimal behavior policy --- the behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present a behavior policy search algorithm and empirically demonstrate its effectiveness in lowering the mean squared error of policy performance estimates.

PUBLICATION RECORD

Publication year
2017
Venue
International Conference on Machine Learning
Publication date
2017-06-12
Fields of study
Computer Science
Identifiers
arXiv 1706.03469
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

OFFER: Off-Environment Reinforcement Learning
2017influential reference
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
2016cited by this paper
Benchmarking Deep Reinforcement Learning for Continuous Control
2016influential reference
High-Confidence Off-Policy Evaluation
2015cited by this paper
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
2015cited by this paper
Doubly Robust Off-policy Evaluation for Reinforcement Learning
2015cited by this paper
Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees
2015cited by this paper
A Notation for Markov Decision Processes
2015cited by this paper
Trust Region Policy Optimization
2015cited by this paper
Continuous control with deep reinforcement learning
2015cited by this paper
Model-Free Intelligent Diabetes Management Using Machine Learning
2014cited by this paper
True online TD(λ)
2014cited by this paper
Variance Reduction in Monte-Carlo Tree Search
2011cited by this paper
Reinforcement Learning in Finite MDPs: PAC Analysis
2009cited by this paper
Learning a Value Analysis Tool for Agent Evaluation
2009cited by this paper
Reinforcement learning in the presence of rare events
2008cited by this paper
Optimal Unbiased Estimators for Evaluating Agent Performance
2006cited by this paper
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
2004cited by this paper
Near-Optimal Reinforcement Learning in Polynomial Time
2002cited by this paper
Simulation in optimization and optimization in simulation: a markov chain perspective on adaptive Monte Carlo algorithms
2001cited by this paper
Eligibility Traces for Off-Policy Policy Evaluation
2000cited by this paper
Policy Gradient Methods for Reinforcement Learning with Function Approximation
1999cited by this paper
Gradient Convergence in Gradient methods with Errors
1999cited by this paper
Reinforcement Learning: An Introduction
1998cited by this paper
Large Sample Methods in Statistics: An Introduction with Applications
1993cited by this paper

CITED BY

Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning
2026cites this paper
Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning
2026cites this paper
Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning
2025cites this paper
Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation
2025cites this paper
Balancing Interference and Correlation in Spatial Experimental Designs: A Causal Graph Cut Approach
2025cites this paper
Clustered KL-barycenter design for policy evaluation
2025cites this paper
Adaptive Exploration for Multi-Reward Multi-Policy Evaluation
2025influential citation
Optimistic Algorithms for Adaptive Estimation of the Average Treatment Effect
2025cites this paper
Navigating the Intersection of AI and Healthcare: Building Customer Trust Through Ethical Principles and Legislative Frameworks
2025cites this paper
Policy Gradient with Active Importance Sampling
2024cites this paper
Logarithmic Neyman Regret for Adaptive Estimation of the Average Treatment Effect
2024cites this paper
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach
2024cites this paper
Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning
2024influential citation
Doubly Optimal Policy Evaluation for Reinforcement Learning
2024influential citation
SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular MDP
2024cites this paper
Adaptive Exploration for Data-Efficient General Value Function Evaluations
2024cites this paper
Unraveling the Interplay between Carryover Effects and Reward Autocorrelations in Switchback Experiments
2024cites this paper
Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
2023influential citation
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits
2023influential citation
Improving Monte Carlo Evaluation with Oﬄine Data
2023influential citation
A New Challenge in Policy Evaluation
2023influential citation
On the Relation between Policy Improvement and Off-Policy Minimum-Variance Policy Evaluation
2023cites this paper
Reinforcement Learning Algorithms with Selector, Tuner, or Estimator
2023cites this paper
Efficient Open-world Reinforcement Learning via Knowledge Distillation and Autonomous Rule Discovery
2023cites this paper
Improving Monte Carlo Evaluation with Offline Data
2023influential citation
ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling
2022cites this paper
Lessons on off-policy methods from a notification component of a chatbot
2021cites this paper
Subgaussian Importance Sampling for Off-Policy Evaluation and Learning
2021cites this paper
Robust On-Policy Data Collection for Data-Efficient Policy Evaluation
2021influential citation
Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning
2021cites this paper
Behavior Policy Search for Risk Estimators in RL
2021cites this paper
Importance sampling in reinforcement learning with an estimated behavior policy
2021cites this paper
Deep Reinforcement Learning for the Control of Robotic Manipulation: A Focussed Mini-Review
2021cites this paper
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning
2021cites this paper
Iterative residual tuning for system identification and sim-to-real robot learning
2020cites this paper
Beyond variance reduction: Understanding the true impact of baselines on policy optimization
2020cites this paper
Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey
2020cites this paper
Causality and Batch Reinforcement Learning: Complementary Approaches To Planning In Unknown Domains
2020cites this paper
Reinforcement Learning Architectures: SAC, TAC, and ESAC
2020cites this paper
Adaptive Off-Policy Policy Gradient Methods
2019cites this paper
Enhancing the performance of energy harvesting wireless communications using optimization and machine learning
2019cites this paper
TuneNet: One-Shot Residual Tuning for System Identification and Sim-to-Real Robot Task Transfer
2019cites this paper
Provably Efficient Q-Learning with Low Switching Cost
2019cites this paper
Data efficient reinforcement learning with off-policy and simulated data
2019cites this paper
Independence-aware Advantage Estimation
2019cites this paper
Selector-Actor-Critic and Tuner-Actor-Critic Algorithms for Reinforcement Learning
2019cites this paper
Diverse Exploration for Fast and Safe Policy Improvement
2018cites this paper
Towards a Data Efficient Off-Policy Policy Gradient
2018cites this paper
Importance Sampling Policy Evaluation with an Estimated Behavior Policy
2018cites this paper