Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate (2403.00675v1)
Abstract: Reinforcement learning provides a mathematical framework for learning-based control, whose success largely depends on the amount of data it can utilize. The efficient utilization of historical trajectories obtained from previous policies is essential for expediting policy optimization. Empirical evidence has shown that policy gradient methods based on importance sampling work well. However, existing literature often neglect the interdependence between trajectories from different iterations, and the good empirical performance lacks a rigorous theoretical justification. In this paper, we study a variant of the natural policy gradient method with reusing historical trajectories via importance sampling. We show that the bias of the proposed estimator of the gradient is asymptotically negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate. We further apply the proposed estimator to popular policy optimization algorithms such as trust region policy optimization. Our theoretical results are verified on classical benchmarks.
- Policy gradient methods for reinforcement learning with function approximation. In Sara A. Solla, Todd K. Leen, and Klaus-Robert Müller, editors, Advances in Neural Information Processing Systems, pages 1057–1063, 1999.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
- Actor-critic algorithms. In Sara A. Solla, Todd K. Leen, and Klaus-Robert Müller, editors, Advances in Neural Information Processing Systems, pages 1008–1014, Cambridge, Massachusetts, 1999. MIT Press.
- Badgr: An autonomous self-supervised learning-based navigation system, 2020. arXiv: 2002.05700.
- Model selection for offline reinforcement learning: Practical considerations for healthcare settings. In Machine Learning for Healthcare Conference, pages 2–35, 2021.
- Offline reinforcement learning for autonomous driving with real world driving data. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pages 3417–3422, 2022.
- Learning from scarce experience, 2002. arXiv:cs/0204043.
- Optimization of static simulation models by the score function method. Mathematics and Computers in Simulation, 32(4):373–392, 1990.
- Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pages 2605–2613, 2019.
- Policy optimization via importance sampling. Advances in Neural Information Processing Systems, 31, 2018.
- Offline policy evaluation across representations with applications to educational games. In Alessio Lomuscio, Paul Scerri, Ana Bazzan, and Michael Huhns, editors, Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, volume 1077, 2014.
- Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
- Continuous control with deep reinforcement learning, 2015. arXiv:1509.02971.
- Breaking the curse of horizon: Infinite-horizon off-policy estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 5361–5371, La Jolla, California, 2018. Neural Information Processing Systems Foundation, Inc.
- Importance sampling techniques for policy optimization. The Journal of Machine Learning Research, 21(1):5552–5626, 2020.
- Variance reduction based partial trajectory reuse to accelerate policy gradient optimization. In B. Feng, G. Pedrielli, Y. Peng, S. Shashaani, E. Song, C.G. Corlu, L.H. Lee, and P. Lendermann, editors, Proceedings of the 2022 Winter Simulation Conference, 2022.
- Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
- Sham M Kakade. A natural policy gradient. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, page 1531–1538, 2001.
- Reusing search data in ranking and selection: What could possibly go wrong? ACM Transactions on Modeling and Computer Simulation, 28(3):1–15, 2018.
- Green simulation optimization using likelihood ratio estimators. In Markus Rabe, Angel A. Juan, Navonil Mustafee, Anders Skoogh, Sanjay Jain, and Bjorn Johansson, editors, Proceedings of the 2018 Winter Simulation Conference, pages 2049–2060, 2018.
- Simulation optimization by reusing past replications: Don’t be afraid of dependence. In Ki-Hwan G. Bae, Ben Feng, Sojung Kim, Sanja Lazarova-Molnar, Zeyu Zheng, Theresa Roeder, and Renee Thiesing, editors, Proceedings of the 2020 Winter Simulation Conference, pages 2923–2934, 2020.
- Trust region policy optimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, pages 1889–1897, 2015.
- Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- On finite-time convergence of actor-critic algorithm. IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021.
- On the sample complexity of actor-critic method for reinforcement learning with function approximation. Machine Learning, pages 1–35, 2023.
- H. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, New York City, New York, 2003.
- Improving sample complexity bounds for (natural) actor-critic algorithms. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, pages 4358–4369, La Jolla, California, 2020. Neural Information Processing Systems Foundation, Inc.
- Hado Van Hasselt. Reinforcement learning in continuous state and action spaces. In Reinforcement Learning: State-of-the-Art, pages 207–251. Springer, 2012.
- Off-policy actor-critic. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Machine Learning, page 179–186, 2012.
- Optimal control: linear quadratic methods. Courier Corporation, 2007.
- Ornstein-Uhlenbeck processes and extensions. Handbook of Financial Time Series, pages 421–437, 2009.
- Daniel W Stroock. Probability theory: an analytic view. Cambridge university press, 2010.
- Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020.
- Constrained policy optimization. In International conference on machine learning, pages 22–31, 2017.
- Joel Tropp. Freedman’s inequality for matrix martingales. Electronic Communications in Probability, 16:262 – 270, 2011.
- Joseph L Doob. Regularity properties of certain families of chance variables. Transactions of the American Mathematical Society, 47(3):455–486, 1940.