Mollification Effects of Policy Gradient Methods (2405.17832v1)
Abstract: Policy gradient methods have enabled deep reinforcement learning (RL) to approach challenging continuous control problems, even when the underlying systems involve highly nonlinear dynamics that generate complex non-smooth optimization landscapes. We develop a rigorous framework for understanding how policy gradient methods mollify non-smooth optimization landscapes to enable effective policy search, as well as the downside of it: while making the objective function smoother and easier to optimize, the stochastic objective deviates further from the original problem. We demonstrate the equivalence between policy gradient methods and solving backward heat equations. Following the ill-posedness of backward heat equations from PDE theory, we present a fundamental challenge to the use of policy gradient under stochasticity. Moreover, we make the connection between this limitation and the uncertainty principle in harmonic analysis to understand the effects of exploration with stochastic policies in RL. We also provide experimental results to illustrate both the positive and negative aspects of mollification effects in practice.
- Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions. arXiv preprint arXiv:1207.4421, 2012.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
- On the linear convergence of policy gradient methods for finite MDPs. International Conference on Artificial Intelligence and Statistics, 2021.
- Variable smoothing for convex optimization problems using stochastic gradients. Journal of Scientific Computing, 85(33), 2020.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Provably efficient exploration in policy optimization. Proceedings of the 37th International Conference on Machine Learning, pp. 1283–1294, 2020.
- Neural Lyapunov control. Advances in Neural Information Processing Systems, 32:3245–3254, 2019.
- Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization. arXiv preprint arXiv:2301.06428, 2023.
- Evans, L. C. Partial Differential Equations. American Mathematical Society, 2010.
- Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of the 35th International Conference on Machine Learning, pp. 1467–1476, 2018.
- Addressing function approximation error in actor-critic methods. Proceedings of the 35th International Conference on Machine Learning, pp. 1587–1596, 2018.
- Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. arXiv preprint arXiv:1610.00633, 2016.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning, pp. 1861–1870, 2018a.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
- Deep Q-learning from demonstrations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.
- Höllig, K. Existence of infinitely many solutions for a forward backward heat equation. Transactions of the American Mathematical Society, 278(1), 1983.
- Is q-learning provably efficient? Advances in Neural Information Processing Systems, 31, 2018.
- Provably efficient reinforcement learning with linear function approximation. Proceedings of Thirty Third Conference on Learning Theory, pp. 2137–2143, 2020.
- Kabanikhin, S. I. Definitions and examples of inverse and ill-posed problems. J. Inv. Ill-Posed Problems, 16:317–357, 2008.
- Kakade, S. M. A natural policy gradient. Advances in Neural Information Processing Systems 14, pp. 1531–1538, 2001.
- Human Motion Control of Quadrupedal Robots using Deep Reinforcement Learning. Proceedings of Robotics: Science and Systems, 2022.
- Actor-critic algorithms. Advances in Neural Information Processing Systems, 12, 1999.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization. arXiv preprint arXiv:2209.05045, 2022.
- Gradients are not all you need. 2021.
- Parmas, P. Total stochastic gradient algorithms and applications in reinforcement learning. NeurIPS, 2018.
- A unified view of likelihood ratio and reparameterization gradients. International Conference on Artificial Intelligence and Statistics, pp. 4078–4086, 2021.
- Pipps: Flexible model-based policy search robust to the curse of chaos. Proceedings of the 35th International Conference on Machine Learning, pp. 4065–4074, 2018.
- Natural actor-critic. Neurocomputing, 71(7):1180–1190, 2008.
- Extrema and nowhere differentiable functions. The Rocky Mountain Journal of Mathematics, 16(4):661–668, 1986.
- An introduction to partial differential equations. Springer, 1993.
- Rudin, W. Functional Analysis. McGraw-Hill, Inc., 1991.
- Universal value function approximators. Proceedings of the 32nd International Conference on Machine Learning, pp. 1312–1320, 2015.
- Trust region policy optimization. Proceedings of the 32nd International Conference on Machine Learning, pp. 1889–1897, 2015a.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Shamir, O. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. arXiv preprint arXiv:1507.08752, 2015.
- Reachability-guided sampling for planning under differential constraints. IEEE International Conference on Robotics and Automation, pp. 2859–2865, 2009.
- Deterministic policy gradient algorithms. Proceedings of the 31st International Conference on Machine Learning, pp. 387–395, 2014.
- Fourier Analysis: An Introduction. Princeton University Press, 2003.
- Do differentiable simulators give better policy gradients? Proceedings of the 39th International Conference on Machine Learning, 162:20668–20696, 2022.
- Reinforcement Learning: an Introduction. MIT Press, 1998.
- Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 12, pp. 1057–1063, 1999.
- Analysis of the optimization landscape of linear quadratic gaussian (LQG) control. Proceedings of the 3rd Conference on Learning for Dynamics and Control, pp. 599–610, 2021.
- Tedrake, R. Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation (Course Notes for MIT 6.832).
- Deep reinforcement learning with double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), 2016.
- Fractal landscapes in policy optimization. arXiv preprint arXiv:2310.15418, 2023.
- Stochastic zeroth-order optimization in high dimensions. pp. 1356–1365, 2018.
- Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136, 2019.
- Q-learning. Machine Learning, 8:279–292, 1992.
- Natural evolution strategies. Journal of Machine Learning Research, 15:949–980, 2014.
- Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
- Xiao, L. On the convergence rates of policy gradient methods. arXiv preprint arXiv:2201.07443, 2022.
- Connected superlevel set in (deep) reinforcement learning and its application to minimax theorems. arXiv preprint arXiv:2303.12981, 2023.
- Policy optimization for ℋ2subscriptℋ2\mathcal{H}_{2}caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT linear control with ℋ∞subscriptℋ\mathcal{H}_{\infty}caligraphic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT robustness guarantee: Implicit regularization and global convergence. SIAM Journal on Control and Optimization, 59(6):4081–4109, 2021.
- Adaptive barrier smoothing for first-order policy gradient with contact dynamics. Proceedings of Machine Learning Research, pp. 41219–41243, 2023.