Enhancing Policy Gradient with the Polyak Step-Size Adaption (2404.07525v1)
Abstract: Policy gradient is a widely utilized and foundational algorithm in the field of reinforcement learning (RL). Renowned for its convergence guarantees and stability compared to other RL algorithms, its practical application is often hindered by sensitivity to hyper-parameters, particularly the step-size. In this paper, we introduce the integration of the Polyak step-size in RL, which automatically adjusts the step-size without prior knowledge. To adapt this method to RL settings, we address several issues, including unknown f* in the Polyak step-size. Additionally, we showcase the performance of the Polyak step-size in RL through experiments, demonstrating faster convergence and the attainment of more stable policies.
- Reinforcement learning: Theory and algorithms. 2019. URL https://api.semanticscholar.org/CorpusID:148567317.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
- A novel framework for policy mirror descent with general parameterization and linear convergence. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Infinite-horizon policy-gradient estimation. J. Artif. Intell. Res., 15:319–350, 2001. doi: 10.1613/JAIR.806. URL https://doi.org/10.1613/jair.806.
- Visualizing the loss landscape of actor critic methods with applications in inventory optimization. CoRR, abs/2009.02391, 2020. URL https://arxiv.org/abs/2009.02391.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Adaptive step-size for online temporal difference learning. In Jörg Hoffmann and Bart Selman (eds.), Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada, pp. 872–878. AAAI Press, 2012. doi: 10.1609/AAAI.V26I1.8313. URL https://doi.org/10.1609/aaai.v26i1.8313.
- Hyperparameters in reinforcement learning and how to tune them. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 9104–9149. PMLR, 2023. URL https://proceedings.mlr.press/v202/eimer23a.html.
- Addressing function approximation error in actor-critic methods. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 1582–1591. PMLR, 2018. URL http://proceedings.mlr.press/v80/fujimoto18a.html.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 1856–1865. PMLR, 2018. URL http://proceedings.mlr.press/v80/haarnoja18b.html.
- H. V. Hasselt. Double q-learning. In Neural Information Processing Systems, 2010. URL https://api.semanticscholar.org/CorpusID:5155799.
- Adaptive SGD with polyak stepsize and line-search: Robust convergence and variance reduction. CoRR, abs/2308.06058, 2023. doi: 10.48550/ARXIV.2308.06058. URL https://doi.org/10.48550/arXiv.2308.06058.
- Sham M. Kakade. A natural policy gradient. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani (eds.), Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pp. 1531–1538. MIT Press, 2001. URL https://proceedings.neurips.cc/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Continuous control with deep reinforcement learning. In Yoshua Bengio and Yann LeCun (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1509.02971.
- Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In Arindam Banerjee and Kenji Fukumizu (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 1306–1314. PMLR, 2021. URL http://proceedings.mlr.press/v130/loizou21a.html.
- Adaptive step-size policy gradients with average reward metric. In Masashi Sugiyama and Qiang Yang (eds.), Proceedings of the 2nd Asian Conference on Machine Learning, ACML 2010, Tokyo, Japan, November 8-10, 2010, volume 13 of JMLR Proceedings, pp. 285–298. JMLR.org, 2010. URL http://proceedings.mlr.press/v13/matsubara10a.html.
- Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR.
- Dynamics of SGD with stochastic polyak stepsizes: Truly adaptive variants and convergence to exact solution. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ac662d74829e4407ce1d126477f4a03a-Abstract-Conference.html.
- Automatic differentiation in pytorch. 2017.
- Relative entropy policy search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, pp. 1607–1612. AAAI Press, 2010.
- Adaptive step-size for policy gradient methods. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 1394–1402, 2013. URL https://proceedings.neurips.cc/paper/2013/hash/f64eac11f2cd8f0efa196f8ad173178e-Abstract.html.
- B. T. Polyak. Introduction to optimization. 1987. URL https://api.semanticscholar.org/CorpusID:260584402.
- Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, Inc., USA, 1994. ISBN 0471619779.
- Momo: Momentum models for adaptive learning rates. CoRR, abs/2305.07583, 2023. doi: 10.48550/ARXIV.2305.07583. URL https://doi.org/10.48550/arXiv.2305.07583.
- Reinforcement learning: An introduction. IEEE Trans. Neural Networks, 9:1054–1054, 1998. URL https://api.semanticscholar.org/CorpusID:60035920.
- Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 3727–3740, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/2557911c1bf75c2b643afb4ecbfc8ec2-Abstract.html.
- Towards noise-adaptive, problem-adaptive (accelerated) stochastic gradient descent. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 22015–22059. PMLR, 2022. URL https://proceedings.mlr.press/v162/vaswani22a.html.
- R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
- Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991. doi: 10.1080/09540099108946587.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.