Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Second-Order Convergence of Biased Policy Gradient Algorithms (2311.02546v4)

Published 5 Nov 2023 in cs.LG

Abstract: Since the objective functions of reinforcement learning problems are typically highly nonconvex, it is desirable that policy gradient, the most popular algorithm, escapes saddle points and arrives at second-order stationary points. Existing results only consider vanilla policy gradient algorithms with unbiased gradient estimators, but practical implementations under the infinite-horizon discounted reward setting are biased due to finite-horizon sampling. Moreover, actor-critic methods, whose second-order convergence has not yet been established, are also biased due to the critic approximation of the value function. We provide a novel second-order analysis of biased policy gradient methods, including the vanilla gradient estimator computed from Monte-Carlo sampling of trajectories as well as the double-loop actor-critic algorithm, where in the inner loop the critic improves the approximation of the value function via TD(0) learning. Separately, we also establish the convergence of TD(0) on Markov chains irrespective of initial state distribution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Optimality and approximation with policy gradient methods in markov decision processes. In Abernethy, J. and Agarwal, S. (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp.  64–66. PMLR, 09–12 Jul 2020.
  2. A novel framework for policy mirror descent with general parameterization and linear convergence. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  3. Infinite-horizon policy-gradient estimation. J. Artif. Int. Res., 15(1):319–350, Nov 2001. ISSN 1076-9757.
  4. Global optimality guarantees for policy gradient methods. CoRR, abs/1906.01786, 2019.
  5. On the linear convergence of policy gradient methods for finite mdps. In Banerjee, A. and Fukumizu, K. (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp.  2386–2394. PMLR, 13–15 Apr 2021.
  6. A finite time analysis of temporal difference learning with linear function approximation. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp.  1691–1692. PMLR, 06–09 Jul 2018.
  7. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009. ISSN 0005-1098. doi: https://doi.org/10.1016/j.automatica.2009.07.008.
  8. Finite-time analysis of entropy-regularized neural natural actor-critic algorithm, 2022.
  9. Finite sample analyses for td(0) with function approximation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018. ISBN 978-1-57735-800-8.
  10. Escaping saddles with stochastic gradients. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  1155–1164. PMLR, 10–15 Jul 2018.
  11. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  12. Single-timescale actor-critic provably finds globally optimal policy. CoRR, abs/2008.00483, 2020.
  13. Escaping from saddle points - online stochastic gradient for tensor decomposition. CoRR, abs/1503.02101, 2015.
  14. Policy gradient finds global optimum of nearly linear-quadratic control systems. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
  15. Neural tangent kernels, transportation mappings, and universal approximation. CoRR, abs/1910.06956, 2019.
  16. How to escape saddle points efficiently. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  1724–1732. PMLR, 06–11 Aug 2017.
  17. Stochastic gradient descent escapes saddle points efficiently. CoRR, abs/1902.04811, 2019.
  18. Efficiently escaping saddle points for non-convex policy optimization, 2023.
  19. Actor-critic algorithms. In Solla, S., Leen, T., and Müller, K. (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999.
  20. On the sample complexity of actor-critic method for reinforcement learning with function approximation. CoRR, abs/1910.08412, 2019.
  21. Temporal difference learning as gradient splitting. CoRR, abs/2010.14657, 2020.
  22. On the global convergence rates of softmax policy gradient methods. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  6820–6829. PMLR, 13–18 Jul 2020.
  23. Cubic regularization of newton method and its global performance. Math. Program., 108:177–205, 08 2006. doi: 10.1007/s10107-006-0706-8.
  24. A small gain analysis of single timescale actor critic. SIAM Journal on Control and Optimization, 61(2):980–1007, 2023. doi: 10.1137/22M1483335.
  25. Stochastic variance-reduced policy gradient. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  4026–4035. PMLR, 10–15 Jul 2018.
  26. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2008.02.003. Robotics and Neuroscience.
  27. On finite-time convergence of actor-critic algorithm. IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021. doi: 10.1109/JSAIT.2021.3078754.
  28. Hessian aided policy gradient. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  5729–5738. PMLR, 09–15 Jun 2019.
  29. Policy gradient methods for reinforcement learning with function approximation. In Solla, S., Leen, T., and Müller, K. (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999.
  30. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. doi: 10.1109/9.580874.
  31. Second-order guarantees of stochastic gradient descent in nonconvex optimization. IEEE Transactions on Automatic Control, 67(12):6489–6504, 2022. doi: 10.1109/TAC.2021.3131963.
  32. Second-order guarantees in federated learning. In 2020 54th Asilomar Conference on Signals, Systems, and Computers, pp.  915–922, 2020. doi: 10.1109/IEEECONF51394.2020.9443421.
  33. Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJgQfkSYDS.
  34. Stochastic cubic-regularized policy gradient method. Knowledge-Based Systems, 255:109687, 2022. ISSN 0950-7051. doi: https://doi.org/10.1016/j.knosys.2022.109687.
  35. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 2004.
  36. Understanding policy gradient algorithms: A sensitivity-based approach. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  24131–24149. PMLR, 17–23 Jul 2022.
  37. A finite-time analysis of two time-scale actor-critic methods. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  17617–17628. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/cc9b3c69b56df284846bf2432f1cba90-Paper.pdf.
  38. Improving sample complexity bounds for (natural) actor-critic algorithms. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  4358–4369. Curran Associates, Inc., 2020a.
  39. Improving sample complexity bounds for actor-critic algorithms. CoRR, abs/2004.12956, 2020b.
  40. Sample complexity of policy gradient finding second-order stationary points. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):10630–10638, May 2021.
  41. A finite sample analysis of the actor-critic algorithm. In 2018 IEEE Conference on Decision and Control (CDC), pp.  2759–2764, 2018. doi: 10.1109/CDC.2018.8619440.
  42. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  43. A general sample complexity analysis of vanilla policy gradient. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp.  3332–3380. PMLR, 28–30 Mar 2022.
  44. Linear convergence of natural policy gradient methods with log-linear policies. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=-z9hdsyUwVQ.
  45. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020. doi: 10.1137/19M1288012.
  46. Distributed policy gradient with heterogeneous computations for federated reinforcement learning. In 2023 57th Annual Conference on Information Sciences and Systems (CISS), pp.  1–6, 2023. doi: 10.1109/CISS56502.2023.10089771.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Siqiao Mu (3 papers)
  2. Diego Klabjan (111 papers)

Summary

We haven't generated a summary for this paper yet.