Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff (2405.17796v1)

Published 28 May 2024 in cs.LG and stat.ML

Abstract: Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression (Simchi-Levi and Xu, 2021), we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon H (as known as CMDP with H layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class M containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only O(HlogT) calls to an offline density estimation algorithm (or oracle) across all T rounds of interaction. This number can be further reduced to O(HloglogT) if T is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, 2012.
  2. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646, 2014.
  3. FLAMBE: Structural complexity and representation learning of low rank MDPs. Neural Information Processing Systems (NeurIPS), 2020.
  4. Scalable online exploration via coverability, 2024.
  5. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33–57, 1996.
  6. Programming with linear fractional functionals. Naval Research logistics quarterly, 9(3-4):181–186, 1962.
  7. On the statistical efficiency of reward-free exploration in non-linear rl. Advances in Neural Information Processing Systems, 35:20960–20973, 2022.
  8. Improved sample complexity for reward-free reinforcement learning under low-rank mdps. arXiv preprint arXiv:2303.10859, 2023.
  9. Sample complexity characterization for linear contextual mdps. arXiv preprint arXiv:2402.02700, 2024.
  10. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8060–8070, 2019.
  11. Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 169–178. AUAI Press, 2011.
  12. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. International Conference on Machine Learning (ICML), 2020.
  13. Practical contextual bandits with regression oracles. International Conference on Machine Learning, 2018.
  14. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
  15. On the complexity of adversarial decision making. Advances in Neural Information Processing Systems, 35:35404–35417, 2022.
  16. Online estimation via offline estimation: An information-theoretic framework. arXiv preprint arXiv:2404.10122, 2024.
  17. Contextual markov decision processes, 2015.
  18. Towards minimax optimal reward-free reinforcement learning in linear mdps. In The Eleventh International Conference on Learning Representations, 2022.
  19. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704–1713, 2017.
  20. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
  21. Solving linear programs with sqrt(rank) linear system solves. arXiv preprint arXiv:1910.08033, 2019.
  22. Optimism in face of a context: Regret guarantees for stochastic contextual mdp, 2023.
  23. Optimal reward-agnostic exploration in reinforcement learning. 2023.
  24. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308–322. Springer, 2007.
  25. Representation learning with multi-step inverse kinematics: An efficient and optimal approach to rich-observation rl. In International Conference on Machine Learning, pages 24659–24700. PMLR, 2023.
  26. Efficient model-free exploration in low-rank mdps, 2024.
  27. A simple reward-free approach to constrained reinforcement learning. In International Conference on Machine Learning, pages 15666–15698. PMLR, 2022.
  28. Markov decision processes with continuous side information. In Algorithmic Learning Theory, pages 597–618. PMLR, 2018.
  29. Lecture notes on information theory. Lecture Notes for ECE563 (UIUC) and, 6(2012-2016):7, 2014.
  30. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  31. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 2021.
  32. Reinforcement learning: An introduction. MIT press, 2018.
  33. Beyond no regret: Instance-dependent pac reinforcement learning, 2022.
  34. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020.
  35. Upper counterfactual confidence bounds: a new optimism principle for contextual bandits. arXiv preprint arXiv:2007.07876, 2020.
  36. Upper counterfactual confidence bounds: a new optimism principle for contextual bandits, 2024.
  37. Efficient reinforcement learning in block mdps: A model-free representation learning approach. In International Conference on Machine Learning, pages 26517–26547. PMLR, 2022a.
  38. Nearly minimax optimal reward-free reinforcement learning. arXiv preprint arXiv:2010.05901, 2020.
  39. Near-optimal regret bounds for multi-batch reinforcement learning. Advances in Neural Information Processing Systems, 35:24586–24596, 2022b.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets