Reinforcement Learning in Reward-Mixing MDPs

Jeongyeol Kwon,Yonathan Efroni,C. Caramanis,Shie Mannor

Published 2021 in Neural Information Processing Systems

ABSTRACT

Learning a near optimal policy in a partially observable system remains an elusive challenge in contemporary reinforcement learning. In this work, we consider episodic reinforcement learning in a reward-mixing Markov decision process (MDP). There, a reward function is drawn from one of multiple possible reward models at the beginning of every episode, but the identity of the chosen reward model is not revealed to the agent. Hence, the latent state space, for which the dynamics are Markovian, is not given to the agent. We study the problem of learning a near optimal policy for two reward-mixing MDPs. Unlike existing approaches that rely on strong assumptions on the dynamics, we make no assumptions and study the problem in full generality. Indeed, with no further assumptions, even for two switching reward-models, the problem requires several new ideas beyond existing algorithmic and analysis techniques for efficient exploration. We provide the first polynomial-time algorithm that finds an $\epsilon$-optimal policy after exploring $\tilde{O}(poly(H,\epsilon^{-1}) \cdot S^2 A^2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively. This is the first efficient algorithm that does not require any assumptions in partially observed environments where the observation space is smaller than the latent state space.

PUBLICATION RECORD

Publication year
2021
Venue
Neural Information Processing Systems
Publication date
2021-10-07
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 2110.03743
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Multi-model Markov decision processes
2021cited by this paper
Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints
2021cited by this paper
RL for Latent MDPs: Regret Guarantees and a Lower Bound
2021cited by this paper
High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9
2020cited by this paper
Adaptive Reward-Free Exploration
2020cited by this paper
Sample-Efficient Reinforcement Learning of Undercomplete POMDPs
2020cited by this paper
Reward-Free Exploration for Reinforcement Learning
2020cited by this paper
On the Minimax Optimality of the EM Algorithm for Learning Two-Component Mixed Linear Regression
2020cited by this paper
Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs
2019cited by this paper
Provably efficient RL with Rich Observations via Latent State Decoding
2019cited by this paper
Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds
2019cited by this paper
Explore first
2019cited by this paper
Learning Mixtures of Graphs from Epidemic Cascades
2019cited by this paper
Learning Adversarial MDPs with Bandit Feedback and Unknown Transition
2019cited by this paper
Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?
2019cited by this paper
Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities
2019cited by this paper
Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies
2019cited by this paper
Convex and Nonconvex Formulations for Mixed Regression With Two Components: Minimax Optimal Rates
2018cited by this paper
Solving linear programs in the current matrix multiplication time
2018cited by this paper
On Oracle-Efficient PAC RL with Rich Observations
2018cited by this paper
Computation of weighted sums of rewards for concurrent MDPs
2018cited by this paper
Minimax Regret Bounds for Reinforcement Learning
2017cited by this paper
Explore First, Exploit Next: The True Shape of Regret in Bandit Problems
2016cited by this paper
Contextual Decision Processes with low Bellman rank are PAC-Learnable
2016cited by this paper
Reinforcement Learning of POMDPs using Spectral Methods
2016cited by this paper
Why is Posterior Sampling Better than Optimism for Reinforcement Learning?
2016cited by this paper
A PAC RL Algorithm for Episodic POMDPs
2016cited by this paper
Robust Estimators in High Dimensions without the Computational Intractability
2016cited by this paper
PAC Reinforcement Learning with Rich Observations
2016influential reference
Contextual Markov Decision Processes
2015cited by this paper
Latent Bandits
2014cited by this paper
Model-based Reinforcement Learning and the Eluder Dimension
2014cited by this paper
Learning Mixtures of Discrete Product Distributions using Spectral Decompositions
2013cited by this paper
Eluder Dimension and the Sample Complexity of Optimistic Exploration
2013cited by this paper
Sample Complexity of Multi-task Reinforcement Learning
2013cited by this paper
The adversarial stochastic shortest path problem with unknown transition probabilities
2012cited by this paper
MOMDPs: A Solution for Modelling Adaptive Management Problems
2012cited by this paper
Learning mixtures of structured distributions over discrete domains
2012cited by this paper
A Bayesian Sampling Approach to Exploration in Reinforcement Learning
2009cited by this paper
Closing the learning-planning loop with predictive state representations
2009cited by this paper
Near-optimal Regret Bounds for Reinforcement Learning
2008cited by this paper
A spectral algorithm for learning Hidden Markov Models
2008cited by this paper
Markov Decision Processes with Arbitrary Reward Processes
2008cited by this paper
Anytime Point-Based Approximations for Large POMDPs
2006cited by this paper
Learning mixtures of product distributions over discrete domains
2005cited by this paper
Near-Optimal Reinforcement Learning in Polynomial Time
2002cited by this paper
Estimating a mixture of two product distributions
1999cited by this paper
Learning mixtures of Gaussians
1999cited by this paper
Mixtures of linear regressions
1989cited by this paper
A Linear-Time Algorithm for Testing the Truth of Certain Quantified Boolean Formulas
1979cited by this paper
At the Same Time
1971cited by this paper

CITED BY

Adaptive Exploration for Latent-State Bandits
2026cites this paper
Multi-Environment POMDPs: Discrete Model Uncertainty Under Partial Observability
2025cites this paper
Prioritising explainable AI-driven recommendations with knowledge graphs and reinforcement learning
2025cites this paper
Reles-OTA: A Reinforcement-Learning Enhanced Scalable Over-the-Air Update Approach for CAVs
2025cites this paper
A Classification View on Meta Learning Bandits
2025cites this paper
RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation
2024cites this paper
Test-Time Regret Minimization in Meta Reinforcement Learning
2024influential citation
Models as a Key Factor of Environments Design in Multi-Agent Reinforcement Learning
2024cites this paper
Reward-Mixing MDPs with Few Latent Contexts are Learnable
2023cites this paper
Planning and Learning in Partially Observable Systems via Filter Stability
2023cites this paper
Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms
2022cites this paper
Tractable Optimality in Episodic Latent MABs
2022cites this paper
Reward-Mixing MDPs with a Few Latent Contexts are Learnable
2022cites this paper
Planning in Observable POMDPs in Quasipolynomial Time
2022cites this paper
Learning in Observable POMDPs, without Computationally Intractable Oracles
2022cites this paper
Reinforcement Learning with Brain-Inspired Modulation can Improve Adaptation to Environmental Changes
2022cites this paper
When Is Partially Observable Reinforcement Learning Not Scary?
2022influential citation
Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization
2022cites this paper
Settling Statistical Barriers for the Deployment of a Meta-Trained Agent
year unknowncites this paper