Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mixed Policy Gradient: off-policy reinforcement learning driven jointly by data and model (2102.11513v2)

Published 23 Feb 2021 in cs.LG

Abstract: Reinforcement learning (RL) shows great potential in sequential decision-making. At present, mainstream RL algorithms are data-driven, which usually yield better asymptotic performance but much slower convergence compared with model-driven methods. This paper proposes mixed policy gradient (MPG) algorithm, which fuses the empirical data and the transition model in policy gradient (PG) to accelerate convergence without performance degradation. Formally, MPG is constructed as a weighted average of the data-driven and model-driven PGs, where the former is the derivative of the learned Q-value function, and the latter is that of the model-predictive return. To guide the weight design, we analyze and compare the upper bound of each PG error. Relying on that, a rule-based method is employed to heuristically adjust the weights. In particular, to get a better PG, the weight of the data-driven PG is designed to grow along the learning process while the other to decrease. Simulation results show that the MPG method achieves the best asymptotic performance and convergence speed compared with other baseline algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
  2. Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo, and K. Li, “Centralized cooperation for connected and automated vehicles at intersections by proximal policy optimization,” IEEE Transactions on Vehicular Technology, vol. 69, no. 11, pp. 12 597–12 608, 2020.
  3. O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature, pp. 1–5, 2019.
  4. J. Duan, J. Li, X. Chen, K. Zhao, S. E. Li, and L. Zhao, “Optimization landscape of policy gradient methods for discrete-time static output feedback,” IEEE Transactions on Cybernetics, 2023.
  5. J. Duan, W. Cao, Y. Zheng, and L. Zhao, “On the optimization landscape of dynamic output feedback linear quadratic control,” IEEE Transactions on Automatic Control, 2023.
  6. J. Duan, Y. Ren, F. Zhang, Y. Guan, D. Yu, S. E. Li, B. Cheng, and L. Zhao, “Encoding distributional soft actor-critic for autonomous driving in multi-lane scenarios,” arXiv preprint arXiv:2109.05540, 2021.
  7. R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
  8. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  9. J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  10. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
  11. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  12. S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” arXiv preprint arXiv:1802.09477, 2018.
  13. J. Duan, Y. Guan, S. E. Li, Y. Ren, Q. Sun, and B. Cheng, “Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,” IEEE transactions on neural networks and learning systems, vol. 33, no. 11, pp. 6584–6598, 2021.
  14. Y. Yang, H. Modares, K. G. Vamvoudakis, W. He, C.-Z. Xu, and D. C. Wunsch, “Hamiltonian-driven adaptive dynamic programming with approximation errors,” IEEE Transactions on Cybernetics, 2021.
  15. M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.
  16. N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa, “Learning continuous control policies by stochastic value gradients,” in Advances in Neural Information Processing Systems, 2015, pp. 2944–2952.
  17. P. Parmas, C. E. Rasmussen, J. Peters, and K. Doya, “Pipps: Flexible model-based policy search robust to the curse of chaos,” in Proceedings of the 35th International Conference on Machine Learning.   Stockholmsmässan, Stockholm, Sweden: PMLR, 2018, pp. 4062–4071.
  18. Y. Mu, B. Peng, Z. Gu, S. E. Li, C. Liu, B. Nie, J. Zheng, and B. Zhang, “Mixed reinforcement learning for efficient policy optimization in stochastic environments,” in 2020 20th International Conference on Control, Automation and Systems (ICCAS).   IEEE, 2020, pp. 1212–1219.
  19. V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine, “Model-based value estimation for efficient model-free reinforcement learning,” arXiv preprint arXiv:1803.00101, 2018.
  20. J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sample-efficient reinforcement learning with stochastic ensemble value expansion,” in Advances in Neural Information Processing Systems, 2018, pp. 8224–8234.
  21. M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” in Advances in Neural Information Processing Systems, 2019, pp. 12 498–12 509.
  22. G. Shi, X. Shi, M. O’Connell, R. Yu, K. Azizzadenesheli, A. Anandkumar, Y. Yue, and S.-J. Chung, “Neural lander: Stable drone landing control using learned dynamics,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 9784–9790.
  23. T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel, “Model-ensemble trust-region policy optimization,” in 6th International Conference on Learning Representations.   Vancouver, BC, Canada: OpenReview.net, 2018.
  24. D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.
  25. Y. Guan, S. E. Li, J. Duan, J. Li, Y. Ren, Q. Sun, and B. Cheng, “Direct and indirect reinforcement learning,” International Journal of Intelligent Systems, vol. 36, no. 8, pp. 4439–4467, 2021.
  26. A. G. Barto, R. S. Sutton, and C. W. Anderson, “Looking back on the actor–critic architecture,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 1, pp. 40–50, 2020.
  27. V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in neural information processing systems, vol. 12, 1999.
  28. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning.   PMLR, 2018, pp. 1861–1870.
  29. D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
  30. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  31. Y. Luo, H. Xu, Y. Li, Y. Tian, T. Darrell, and T. Ma, “Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees,” arXiv preprint arXiv:1807.03858, 2018.
Citations (11)

Summary

We haven't generated a summary for this paper yet.