Provably Efficient Exploration in Policy Optimization (1912.05830v4)
Abstract: While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d2 H3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
- POLITEX: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning.
- Exploration-enhanced POLITEX. arXiv preprint arXiv:1908.10479.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems.
- Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261.
- Fitted Q-iteration in continuous action-space mdps. In Advances in Neural Information Processing Systems.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47 235–256.
- Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107.
- Dynamic policy programming. Journal of Machine Learning Research, 13 3207–3245.
- Speedy Q-learning. In Advances in Neural Information Processing Systems.
- On the sample complexity of reinforcement learning with a generative model. arXiv preprint arXiv:1206.6461.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning.
- Direct gradient-based reinforcement learning. In International Symposium on Circuits and Systems.
- Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
- Boyan, J. A. (2002). Least-squares temporal difference learning. Machine Learning, 49 233–246.
- Linear least-squares algorithms for temporal difference learning. Machine Learning, 22 33–57.
- Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5 1–122.
- Prediction, Learning, and Games. Cambridge.
- Information-theoretic considerations in batch reinforcement learning. arXiv preprint arXiv:1905.00360.
- Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics.
- Stochastic linear optimization under bandit feedback. Conference on Learning Theory.
- Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems.
- n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG-regret for learning in Markov decision processes with function approximation and low Bellman rank. arXiv preprint arXiv:1909.02506.
- Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016.
- Provably efficient Q-learning with function approximation via distribution shift error checking oracle. arXiv preprint arXiv:1906.06321.
- Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning.
- Online Markov decision processes. Mathematics of Operations Research, 34 726–736.
- Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
- Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039.
- A theory of regularized Markov decision processes. arXiv preprint arXiv:1901.11275.
- Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11 1563–1600.
- Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning.
- Is Q-learning provably efficient? In Advances in Neural Information Processing Systems.
- Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388.
- Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
- Kakade, S. M. (2003). On the Sample Complexity of Reinforcement Learning. Ph.D. thesis, University of London.
- Complexity analysis of real-time reinforcement learning. In Association for the Advancement of Artificial Intelligence.
- Actor-critic algorithms. In Advances in Neural Information Processing Systems.
- Learning with good feature representations in bandits and in RL with a generative model. arXiv preprint arXiv:1911.07676.
- Efficient reinforcement learning with relocatable action models. In Association for the Advancement of Artificial Intelligence.
- Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
- Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055.
- Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 815–857.
- Problem Complexity and Method Efficiency in Optimization. Wiley.
- Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems.
- The online loop-free stochastic shortest-path problem. In Conference on Learning Theory.
- The adversarial stochastic shortest path problem with unknown transition probabilities. In International Conference on Artificial Intelligence and Statistics.
- A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798.
- OpenAI (2019). OpenAI Five. https://openai.com/five/.
- On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732.
- Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635.
- Online convex optimization in adversarial Markov decision processes. arXiv preprint arXiv:1905.07773.
- Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems.
- Linearly parameterized bandits. Mathematics of Operations Research, 35 395–411.
- Trust region policy optimization. In International Conference on Machine Learning.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems.
- Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Symposium on Discrete Algorithms.
- Mastering the game of Go with deep neural networks and tree search. Nature, 529 484.
- Mastering the game of Go without human knowledge. Nature, 550 354.
- PAC model-free reinforcement learning. In International Conference on Machine Learning.
- Reinforcement Learning: An Introduction. MIT.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
- Boosted fitted Q-iteration. In International Conference on Machine Learning.
- Comments on the Du-Kakade-Wang-Yang lower bounds. arXiv preprint arXiv:1911.07910.
- Wainwright, M. J. (2019). Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697.
- Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
- Deep reinforcement learning for NLP. In Association for Computational Linguistics.
- Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42 762–782.
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 229–256.
- Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11 2543–2596.
- Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.
- Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning.
- On the global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. arXiv preprint arXiv:1907.06246.
- A theoretical analysis of deep Q-learning. arXiv preprint arXiv:1901.00137.
- Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34 737–757.
- Provably efficient reinforcement learning for discounted mdps with feature mapping. arXiv preprint arXiv:2006.13165.
- Online learning in episodic Markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems.