Proximal Policy Optimization with Adaptive Threshold for Symmetric Relative Density Ratio (2203.09809v1)
Abstract: Deep reinforcement learning (DRL) is one of the promising approaches for introducing robots into complicated environments. The recent remarkable progress of DRL stands on regularization of policy, which allows the policy to improve stably and efficiently. A popular method, so-called proximal policy optimization (PPO), and its variants constrain density ratio of the latest and baseline policies when the density ratio exceeds a given threshold. This threshold can be designed relatively intuitively, and in fact its recommended value range has been suggested. However, the density ratio is asymmetric for its center, and the possible error scale from its center, which should be close to the threshold, would depend on how the baseline policy is given. In order to maximize the values of regularization of policy, this paper proposes a new PPO derived using relative Pearson (RPE) divergence, therefore so-called PPO-RPE, to design the threshold adaptively. In PPO-RPE, the relative density ratio, which can be formed with symmetry, replaces the raw density ratio. Thanks to this symmetry, its error scale from center can easily be estimated, hence, the threshold can be adapted for the estimated error scale. From three simple benchmark simulations, the importance of algorithm-dependent threshold design is revealed. By simulating additional four locomotion tasks, it is verified that the proposed method statistically contributes to task accomplishment by appropriately restricting the policy updates.
- Layer normalization. arXiv preprint arXiv:1607.06450 .
- Openai gym. arXiv preprint arXiv:1606.01540 .
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models, in: Advances in Neural Information Processing Systems, pp. 4754–4765.
- Model-augmented actor-critic: Backpropagating through paths, in: International Conference on Learning Representations.
- Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository .
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks 107, 3–11.
- Decaying clipping range in proximal policy optimization. arXiv preprint arXiv:2102.10456 .
- A theory of regularized markov decision processes. arXiv preprint arXiv:1901.11275 .
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 .
- Ppo-cma: Proximal policy optimization with covariance matrix adaptation, in: IEEE International Workshop on Machine Learning for Signal Processing, IEEE. pp. 1–6.
- Robust stochastic gradient descent with student-t distribution based first-order momentum. IEEE Transactions on Neural Networks and Learning Systems , 1–14.
- Optimistic proximal policy optimization. arXiv preprint arXiv:1906.11075 .
- Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49, 4335–4347.
- Proximal policy optimization with relative pearson divergence, in: IEEE International Conference on Robotics and Automation, IEEE. pp. 8416–8421.
- Towards deep robot learning with optimizer applicable to non-stationary problems, in: IEEE/SICE International Symposium on System Integration, IEEE. pp. 190–194.
- Adaptive and multiple time-scale eligibility traces for online deep reinforcement learning. Robotics and Autonomous Systems , 104019.
- t-soft update of target network for deep reinforcement learning. Neural Networks 136, 63–71.
- Deep learning. nature 521, 436.
- Guided exploration with proximal policy optimization using a single demonstration, in: International Conference on Machine Learning, PMLR. pp. 6611–6620.
- On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory 52, 4394–4412.
- Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8, 293–321.
- Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, pp. 1928–1937.
- Human-level control through deep reinforcement learning. nature 518, 529–533.
- Td-regularized actor-critic methods. Machine Learning , 1–35.
- Automatic differentiation in pytorch, in: Advances in Neural Information Processing Systems Workshop.
- Prioritized experience replay. arXiv preprint arXiv:1511.05952 .
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 .
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
- Soft policy gradient method for maximum entropy deep reinforcement learning, in: International Joint Conference on Artificial Intelligence, pp. 3425–3431.
- Mastering the game of go with deep neural networks and tree search. nature 529, 484.
- Direct divergence approximation between probability distributions and its applications in machine learning. Journal of Computing Science and Engineering 7, 99–111.
- Reinforcement learning: An introduction. MIT press.
- Truly proximal policy optimization, in: Uncertainty in Artificial Intelligence, PMLR. pp. 113–122.
- Relative density-ratio estimation for robust distribution comparison. Neural computation 25, 1324–1370.
- A functional clipping approach for policy optimization algorithms. IEEE Access 9, 96056–96063.
- Laprop: a better way to combine momentum with adaptive gradient. arXiv preprint arXiv:2002.04839 .