Prior-dependent analysis of posterior sampling reinforcement learning with function approximation (2403.11175v1)
Abstract: This work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs. We establish the first prior-dependent Bayesian regret bound for RL with function approximation; and refine the Bayesian regret analysis for posterior sampling reinforcement learning (PSRL), presenting an upper bound of ${\mathcal{O}}(d\sqrt{H3 T \log T})$, where $d$ represents the dimensionality of the transition kernel, $H$ the planning horizon, and $T$ the total number of interactions. This signifies a methodological enhancement by optimizing the $\mathcal{O}(\sqrt{\log T})$ factor over the previous benchmark (Osband and Van Roy, 2014) specified to linear mixture MDPs. Our approach, leveraging a value-targeted model learning perspective, introduces a decoupling argument and a variance reduction technique, moving beyond traditional analyses reliant on confidence sets and concentration inequalities to formalize Bayesian regret bounds more effectively.
- Vo q𝑞qitalic_q l: Towards optimal regret in model-free rl with nonlinear function approximation. In The Thirty Sixth Annual Conference on Learning Theory, pages 987–1063. PMLR.
- An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR.
- Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, page III–1220–III–1228. JMLR.org.
- Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3(null):397–422.
- Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474.
- Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349.
- Neuro-dynamic programming. Athena Scientific.
- Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR.
- An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems, pages 2249–2257.
- Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, pages 355–366.
- The randomized elliptical potential lemma with an application to linear thompson sampling. arXiv, 2102.07987.
- Randomized exploration in reinforcement learning with general value function approximation. In International Conference on Machine Learning, pages 4607–4616. PMLR.
- Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo. In The Twelfth International Conference on Learning Representations.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143.
- An improved regret bound for thompson sampling in the gaussian linear bandit setting. In IEEE International Symposium on Information Theory, pages 2783–2788.
- Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2):209–232.
- Pac bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pages 320–334.
- Nearly minimax-optimal regret for linearly parameterized bandits. In Beygelzimer, A. and Hsu, D., editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2173–2174. PMLR.
- Hyperagent: A simple, scalable, efficient and provable reinforcement learning framework for complex environments. arXiv preprint arXiv:2402.10228.
- Hyperdqn: A randomized exploration method for deep reinforcement learning. In International Conference on Learning Representations (ICLR).
- Information-theoretic confidence bounds for reinforcement learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Reinforcement learning, bit by bit. arXiv preprint arXiv:2103.04047.
- Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010–2020.
- Influence and variance of a markov chain: Application to adaptive discretization in optimal control. In IEEE Conference on Decision and Control, pages 1464–1469.
- Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pages 1466–1474.
- Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning, pages 2701–2710.
- (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011.
- Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62.
- Puterman, M. L. (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
- Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243.
- An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471.
- Strens, M. (2000). A bayesian framework for reinforcement learning. In International Conference on Machine Learning, pages 943–950.
- Reinforcement learning: An introduction. MIT press.
- Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294.
- Divergence-augmented policy optimization. Advances in Neural Information Processing Systems, 32.
- Exponentially weighted imitation learning for batched historical data. Advances in Neural Information Processing Systems, 31.
- Sample-optimal parametric q-learning using linearly additive features. In International conference on machine learning, pages 6995–7004. PMLR.
- Foundation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129.
- Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964.
- Zhang, T. (2021). Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning. arXiv e-prints, page arXiv:2110.00871.
- Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Belkin, M. and Kpotufe, S., editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 4532–4576. PMLR.
- Yingru Li (14 papers)
- Zhi-Quan Luo (115 papers)