Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Proximal Policy Optimization with Relative Pearson Divergence (2010.03290v2)

Published 7 Oct 2020 in cs.LG and cs.RO

Abstract: The recent remarkable progress of deep reinforcement learning (DRL) stands on regularization of policy for stable and efficient learning. A popular method, named proximal policy optimization (PPO), has been introduced for this purpose. PPO clips density ratio of the latest and baseline policies with a threshold, while its minimization target is unclear. As another problem of PPO, the symmetric threshold is given numerically while the density ratio itself is in asymmetric domain, thereby causing unbalanced regularization of the policy. This paper therefore proposes a new variant of PPO by considering a regularization problem of relative Pearson (RPE) divergence, so-called PPO-RPE. This regularization yields the clear minimization target, which constrains the latest policy to the baseline one. Through its analysis, the intuitive threshold-based design consistent with the asymmetry of the threshold and the domain of density ratio can be derived. Through four benchmark tasks, PPO-RPE performed as well as or better than the conventional methods in terms of the task performance by the learned policy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. H. Modares, I. Ranatunga, F. L. Lewis, and D. O. Popa, “Optimized assistive human–robot interaction using reinforcement learning,” IEEE transactions on cybernetics, vol. 46, no. 3, pp. 655–667, 2015.
  2. Y. Tsurumine, Y. Cui, E. Uchibe, and T. Matsubara, “Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation,” Robotics and Autonomous Systems, vol. 112, pp. 72–83, 2019.
  3. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
  4. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
  5. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
  6. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
  7. S. Parisi, V. Tangkaratt, J. Peters, and M. E. Khan, “Td-regularized actor-critic methods,” Machine Learning, pp. 1–35, 2019.
  8. M. Geist, B. Scherrer, and O. Pietquin, “A theory of regularized markov decision processes,” arXiv preprint arXiv:1901.11275, 2019.
  9. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  10. Y. Wang, H. He, and X. Tan, “Truly proximal policy optimization,” in Uncertainty in Artificial Intelligence.   PMLR, 2020, pp. 113–122.
  11. M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama, “Relative density-ratio estimation for robust distribution comparison,” Neural computation, vol. 25, no. 5, pp. 1324–1370, 2013.
  12. M. Sugiyama, S. Liu, M. C. Du Plessis, M. Yamanaka, M. Yamada, T. Suzuki, and T. Kanamori, “Direct divergence approximation between probability distributions and its applications in machine learning,” Journal of Computing Science and Engineering, vol. 7, no. 2, pp. 99–111, 2013.
  13. E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” GitHub repository, 2016.
  14. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  15. W. E. L. Ilboudo, T. Kobayashi, and K. Sugimoto, “Robust stochastic gradient descent with student-t distribution based first-order momentum,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2020.
  16. F. Liese and I. Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, no. 10, pp. 4394–4412, 2006.
  17. T. Kobayashi and W. E. L. Ilboudo, “t-soft update of target network for deep reinforcement learning,” Neural Networks, 2021.
  18. T. Kobayashi, “Adaptive and multiple time-scale eligibility traces for online deep reinforcement learning,” arXiv preprint arXiv:2008.10040, 2020.
  19. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
  20. T. Kobayashi, “Student-t policy in reinforcement learning to acquire global optimum of robot control,” Applied Intelligence, pp. 1–13, 2019.
  21. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  22. S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural Networks, vol. 107, pp. 3–11, 2018.
  23. L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine learning, vol. 8, no. 3-4, pp. 293–321, 1992.
  24. L. Ziyin, Z. T. Wang, and M. Ueda, “Laprop: a better way to combine momentum with adaptive gradient,” arXiv preprint arXiv:2002.04839, 2020.
  25. T. Kobayashi, “Towards deep robot learning with optimizer applicable to non-stationary problems,” arXiv preprint arXiv:2007.15890, 2020.
  26. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in Advances in Neural Information Processing Systems Workshop, 2017.
Citations (16)

Summary

We haven't generated a summary for this paper yet.