Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL (2305.11032v2)

Published 18 May 2023 in cs.LG and stat.ML

Abstract: While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited -- they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL. Optimistic NPG can be viewed as a simple combination of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] with optimistic policy evaluation subroutines to encourage exploration. For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $\varepsilon$-optimal policy within $\tilde{O}(d2/\varepsilon3)$ samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence $\tilde{\Theta}(d2)$. It also improves over state-of-the-art results of policy optimization algorithms [Zanette et al., 2021] by a factor of $d$. In the realm of general function approximation, which subsumes linear MDPs, Optimistic NPG, to our best knowledge, stands as the first policy optimization algorithm that achieves polynomial sample complexity for learning near-optimal policies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  2. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702. PMLR, 2019.
  3. Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in neural information processing systems, 33:13399–13412, 2020.
  4. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1–76, 2021.
  5. VOq𝑞qitalic_qL: Towards optimal regret in model-free RL with nonlinear function approximation. arXiv preprint arXiv:2212.06069, 2022.
  6. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  7. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
  8. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.
  9. Guided cost learning: Deep inverse optimal control via policy optimization. In International conference on machine learning, pages 49–58. PMLR, 2016.
  10. Nearly minimax optimal reinforcement learning for linear Markov decision processes. arXiv preprint arXiv:2212.06132, 2022.
  11. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  12. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
  13. Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
  14. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.
  15. A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
  16. OpenAI. ChatGPT: Optimizing language models for dialogue, 2022. URL https://openai.com/blog/chatgpt/.
  17. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
  18. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  19. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  20. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pages 8604–8613. PMLR, 2020.
  21. Reward-free RL is no harder than reward-aware RL in linear Markov decision processes. In International Conference on Machine Learning, pages 22430–22456. PMLR, 2022.
  22. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123–6135, 2020.
  23. Nearly optimal policy optimization with stable at any time guarantee. In International Conference on Machine Learning, pages 24243–24265. PMLR, 2022.
  24. Learning near optimal policies with low inherent Bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020.
  25. Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory, pages 4473–4525. PMLR, 2021.
  26. Nearly minimax optimal reinforcement learning for linear mixture Markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR, 2021.
Citations (7)

Summary

We haven't generated a summary for this paper yet.