Simple Policy Optimization (2401.16025v6)
Abstract: As one of the most important and influential algorithms in reinforcement learning, the Proximal Policy Optimization (PPO) algorithm has demonstrated outstanding performance across various domains. It simplifies the optimization-based importance sampling process of the Trust Region Policy Optimization (TRPO) algorithm through ratio clipping. However, this simplification with ratio clipping does not always effectively enforce trust region constraints. In this paper, we introduce an algorithm named \textit{Simple Policy Optimization} (SPO), which incorporates a novel clipping method for the KL divergence between the old and new policies. Extensive experimental results in both \textit{Atari 2600} and \textit{MuJoCo} environments show that, compared to PPO, SPO achieves better sample efficiency, extremely low KL divergence, and higher policy entropy, while also being robust to increases in network depth or complexity. More importantly, SPO maintains the simplicity of an unconstrained first-order algorithm. Our code is available at https://github.com/MyRepositories-hub/Simple-Policy-Optimization.
- Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6672–6679, 2020.
- Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.
- Motion planning for autonomous driving: The state of the art and future perspectives. IEEE Transactions on Intelligent Vehicles, 2023.
- A survey on policy search algorithms for learning robot controllers in a handful of trials. IEEE Transactions on Robotics, 36(2):328–347, 2019.
- Dexterous manipulation for multi-fingered robotic hands with reinforcement learning: a review. Frontiers in Neurorobotics, 16:861825, 2022.
- You only demonstrate once: Category-level manipulation from single visual demonstration. arXiv preprint arXiv:2201.12716, 2022.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
- Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
- Truly proximal policy optimization. In Uncertainty in Artificial Intelligence, pages 113–122. PMLR, 2020.
- Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
- Generalized proximal policy optimization with sample reuse. Advances in Neural Information Processing Systems, 34:11909–11919, 2021.
- On proximal policy optimization’s heavy-tailed gradients. In International Conference on Machine Learning, pages 3610–3619. PMLR, 2021.
- Clipped-objective policy gradients for pessimistic policy optimization. arXiv preprint arXiv:2311.05846, 2023.
- An adaptive clipping approach for proximal policy optimization. arXiv preprint arXiv:1804.06461, 2018.
- Phasic policy gradient. In International Conference on Machine Learning, pages 2020–2027. PMLR, 2021.
- Ppg reloaded: An empirical study on what matters in phasic policy gradient. In International Conference on Machine Learning, pages 36694–36713. PMLR, 2023.
- Joshua Achiam. Spinning Up in Deep Reinforcement Learning. 2018.
- You may not need ratio clipping in ppo. arXiv preprint arXiv:2202.00079, 2022.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
- Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267–274, 2002.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- The 37 implementation details of proximal policy optimization. The ICLR Blog Track 2023, 2022.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Zhengpeng Xie (5 papers)
- Qiang Zhang (467 papers)
- Renjing Xu (72 papers)