Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff (2405.17796v1)
Abstract: Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression (Simchi-Levi and Xu, 2021), we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon H (as known as CMDP with H layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class M containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only O(HlogT) calls to an offline density estimation algorithm (or oracle) across all T rounds of interaction. This number can be further reduced to O(HloglogT) if T is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.
- Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, 2012.
- Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646, 2014.
- FLAMBE: Structural complexity and representation learning of low rank MDPs. Neural Information Processing Systems (NeurIPS), 2020.
- Scalable online exploration via coverability, 2024.
- Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33–57, 1996.
- Programming with linear fractional functionals. Naval Research logistics quarterly, 9(3-4):181–186, 1962.
- On the statistical efficiency of reward-free exploration in non-linear rl. Advances in Neural Information Processing Systems, 35:20960–20973, 2022.
- Improved sample complexity for reward-free reinforcement learning under low-rank mdps. arXiv preprint arXiv:2303.10859, 2023.
- Sample complexity characterization for linear contextual mdps. arXiv preprint arXiv:2402.02700, 2024.
- Provably efficient Q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8060–8070, 2019.
- Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 169–178. AUAI Press, 2011.
- Beyond UCB: Optimal and efficient contextual bandits with regression oracles. International Conference on Machine Learning (ICML), 2020.
- Practical contextual bandits with regression oracles. International Conference on Machine Learning, 2018.
- The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
- On the complexity of adversarial decision making. Advances in Neural Information Processing Systems, 35:35404–35417, 2022.
- Online estimation via offline estimation: An information-theoretic framework. arXiv preprint arXiv:2404.10122, 2024.
- Contextual markov decision processes, 2015.
- Towards minimax optimal reward-free reinforcement learning in linear mdps. In The Eleventh International Conference on Learning Representations, 2022.
- Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704–1713, 2017.
- Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
- Solving linear programs with sqrt(rank) linear system solves. arXiv preprint arXiv:1910.08033, 2019.
- Optimism in face of a context: Regret guarantees for stochastic contextual mdp, 2023.
- Optimal reward-agnostic exploration in reinforcement learning. 2023.
- Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308–322. Springer, 2007.
- Representation learning with multi-step inverse kinematics: An efficient and optimal approach to rich-observation rl. In International Conference on Machine Learning, pages 24659–24700. PMLR, 2023.
- Efficient model-free exploration in low-rank mdps, 2024.
- A simple reward-free approach to constrained reinforcement learning. In International Conference on Machine Learning, pages 15666–15698. PMLR, 2022.
- Markov decision processes with continuous side information. In Algorithmic Learning Theory, pages 597–618. PMLR, 2018.
- Lecture notes on information theory. Lecture Notes for ECE563 (UIUC) and, 6(2012-2016):7, 2014.
- Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 2021.
- Reinforcement learning: An introduction. MIT press, 2018.
- Beyond no regret: Instance-dependent pac reinforcement learning, 2022.
- On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020.
- Upper counterfactual confidence bounds: a new optimism principle for contextual bandits. arXiv preprint arXiv:2007.07876, 2020.
- Upper counterfactual confidence bounds: a new optimism principle for contextual bandits, 2024.
- Efficient reinforcement learning in block mdps: A model-free representation learning approach. In International Conference on Machine Learning, pages 26517–26547. PMLR, 2022a.
- Nearly minimax optimal reward-free reinforcement learning. arXiv preprint arXiv:2010.05901, 2020.
- Near-optimal regret bounds for multi-batch reinforcement learning. Advances in Neural Information Processing Systems, 35:24586–24596, 2022b.