Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds (2309.13915v2)

Published 25 Sep 2023 in cs.LG and stat.ML

Abstract: Policy gradient methods equipped with deep neural networks have achieved great success in solving high-dimensional reinforcement learning (RL) problems. However, current analyses cannot explain why they are resistant to the curse of dimensionality. In this work, we study the sample complexity of the neural policy mirror descent (NPMD) algorithm with deep convolutional neural networks (CNN). Motivated by the empirical observation that many high-dimensional environments have state spaces possessing low-dimensional structures, such as those taking images as states, we consider the state space to be a $d$-dimensional manifold embedded in the $D$-dimensional Euclidean space with intrinsic dimension $d\ll D$. We show that in each iteration of NPMD, both the value function and the policy can be well approximated by CNNs. The approximation errors are controlled by the size of the networks, and the smoothness of the previous networks can be inherited. As a result, by properly choosing the network size and hyperparameters, NPMD can find an $\epsilon$-optimal policy with $\widetilde{O}(\epsilon{-\frac{d}{\alpha}-2})$ samples in expectation, where $\alpha\in(0,1]$ indicates the smoothness of environment. Compared to previous work, our result exhibits that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality, explaining the efficacy of deep policy gradient algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in Neural Information Processing Systems, 33:13399–13412, 2020.
  2. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(1):4431–4506, 2021.
  3. C. Alfano and P. Rebeschini. Linear convergence for natural policy gradient with log-linear policy parametrization. arXiv preprint arXiv:2209.15382, 2022.
  4. A novel framework for policy mirror descent with general parametrization and linear convergence. arXiv preprint arXiv:2301.13139, 2023.
  5. F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(1):629–681, 2017.
  6. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, pages 834–846, 1983.
  7. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  8. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  9. Finite-time analysis of entropy-regularized neural natural actor-critic algorithm. arXiv preprint arXiv:2206.00833, 2022.
  10. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578, 2022.
  11. L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have the same rkhs. arXiv preprint arXiv:2009.10683, 2020.
  12. Efficient approximation of deep relu networks for functions on low dimensional manifolds. Advances in Neural Information Processing Systems, 32, 2019.
  13. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022.
  14. Sphere Packings, Lattices and Groups, volume 290. Springer, 1988.
  15. M. P. Do Carmo and J. Flaherty Francis. Riemannian Geometry, volume 6. Springer, 1992.
  16. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020.
  17. A theoretical analysis of deep q-learning. In Learning for Dynamics and Control, pages 486–489. PMLR, 2020.
  18. H. Federer. Curvature measures. Transactions of the American Mathematical Society, 93(3):418–491, 1959.
  19. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160–2169. PMLR, 2019.
  20. Spectral normalisation for deep reinforcement learning: an optimisation perspective. In International Conference on Machine Learning, pages 3734–3744. PMLR, 2021.
  21. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393–416, 2021.
  22. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, 42(6):1291–1307, 2012.
  23. Improved training of wasserstein gans. Advances in Neural Information Processing Systems, 30, 2017.
  24. On the approximation power of two-layer networks of random relus. In Conference on Learning Theory, pages 2423–2461. PMLR, 2021.
  25. Why do deep residual networks generalize better than deep feedforward networks?—a neural tangent kernel perspective. Advances in Neural Information Processing Systems, 33:2698–2709, 2020.
  26. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31, 2018.
  27. Sample complexity of nonparametric off-policy evaluation on low-dimensional manifolds using deep networks. arXiv preprint arXiv:2206.02887, 2022.
  28. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  29. Optimal convergence rate for exact policy mirror descent in discounted markov decision processes. arXiv preprint arXiv:2302.11381, 2023.
  30. S. M. Kakade. A natural policy gradient. Advances in Neural Information Processing Systems, 14, 2001.
  31. V. Konda and J. Tsitsiklis. Actor-critic algorithms. Advances in Neural Information Processing Systems, 12, 1999.
  32. G. Lan. Policy optimization over general state and action spaces. arXiv preprint arXiv:2211.16715, 2022.
  33. G. Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical Programming, 198(1):1059–1106, 2023.
  34. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  35. Neural proximal/trust region policy optimization attains globally optimal policy. Advances in Neural Information Processing Systems, 32, 2019.
  36. Besov function approximation and binary classification on low-dimensional manifolds using convolutional residual networks. In International Conference on Machine Learning, pages 6770–6780. PMLR, 2021.
  37. Benefits of overparameterized convolutional residual networks: Function approximation under smoothness constraint. In International Conference on Machine Learning, pages 13669–13703. PMLR, 2022.
  38. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):1979–1993, 2018.
  39. W. U. Mondal and V. Aggarwal. Improved sample complexity analysis of natural policy gradient algorithm with general parameterization for infinite horizon discounted reward markov decision processes. arXiv preprint arXiv:2310.11677, 2023.
  40. The implicit bias of minima stability: A view from function space. Advances in Neural Information Processing Systems, 34:17749–17761, 2021.
  41. On sample complexity of offline reinforcement learning with deep reLU networks in besov spaces. Transactions of Machine Learning Research, 2022. URL https://openreview.net/forum?id=LdEm0umNcv.
  42. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 39:419–441, 2008.
  43. K. Oono and T. Suzuki. Approximation and non-parametric estimation of resnet-type convolutional neural networks. In International Conference on Machine Learning, pages 4922–4931. PMLR, 2019.
  44. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  45. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  46. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994.
  47. A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in Neural Information Processing Systems, 20, 2007.
  48. J. Schmidt-Hieber. Deep relu network approximation of functions on a manifold. arXiv preprint arXiv:1908.00695, 2019.
  49. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897. PMLR, 2015.
  50. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  51. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In AAAI Conference on Artificial Intelligence, volume 34(04), pages 5668–5675, 2020.
  52. A mean field view of the landscape of two-layers neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  53. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12, 1999.
  54. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
  55. Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations, 2019.
  56. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement Learning, pages 5–32, 1992.
  57. L. Xiao. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282):1–36, 2022.
  58. Provably efficient reinforcement learning with kernel and neural function approximations. Advances in Neural Information Processing Systems, 33:13903–13916, 2020.
  59. D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
  60. G. Yehudai and O. Shamir. On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  61. Offline reinforcement learning with differentiable function approximation is provably efficient. arXiv preprint arXiv:2210.00750, 2022.
  62. Y. Yoshida and T. Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.
  63. Linear convergence of natural policy gradient methods with log-linear policies. In International Conference on Learning Representations, 2023.
  64. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. SIAM Journal on Optimization, 33(2):1061–1091, 2023.
  65. K. Zhang and Y.-X. Wang. Deep learning meets nonparametric regression: Are weight-decayed dnns locally adaptive? In International Conference on Learning Representations, 2022.
  66. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Summary

We haven't generated a summary for this paper yet.