Absolute Policy Optimization (2310.13230v5)
Abstract: In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Building upon this groundbreaking theoretical advancement, we further introduce a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO as well as its efficient variation Proximal Absolute Policy Optimization (PAPO) significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in worst-case performance, as well as expected performance.
- Constrained policy optimization. In International conference on machine learning, pp. 22–31. PMLR, 2017.
- Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
- A comparison of var and cvar constraints on portfolio selection with the mean-variance model. Management science, 50(9):1261–1273, 2004.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- A distributional perspective on reinforcement learning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 449–458. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/bellemare17a.html.
- The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research, pp. 253–279, Jul 2018. doi: 10.1613/jair.3912. URL http://dx.doi.org/10.1613/jair.3912.
- Safe model-based reinforcement learning with stability guarantees, 2017.
- Brillinger, D. R. Information and Information Stability of Random Variables and Processes. Journal of the Royal Statistical Society Series C: Applied Statistics, 13(2):134–135, 12 2018. ISSN 0035-9254. doi: 10.2307/2985711. URL https://doi.org/10.2307/2985711.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016a.
- Openai gym. arXiv: Learning,arXiv: Learning, Jun 2016b.
- Probabilistic constraint for safety-critical reinforcement learning. arXiv preprint arXiv:2306.17279, 2023.
- Risk-constrained reinforcement learning with percentile risk criteria. CoRR, abs/1512.01629, 2015. URL http://arxiv.org/abs/1512.01629.
- Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–51, 2018.
- Distributional reinforcement learning with quantile regression. CoRR, abs/1710.10044, 2017. URL http://arxiv.org/abs/1710.10044.
- Implicit quantile networks for distributional reinforcement learning. CoRR, abs/1806.06923, 2018. URL http://arxiv.org/abs/1806.06923.
- Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pp. 1329–1338. PMLR, 2016.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
- Continuous deep q-learning with model-based acceleration. In International conference on machine learning, pp. 2829–2838. PMLR, 2016.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
- Deep recurrent q-learning for partially observable mdps. In 2015 aaai fall symposium series, 2015.
- Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Variance penalized on-policy and off-policy actor-critic, 2021.
- Monotonic robust policy optimization with model discrepancy. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 4951–4960. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/jiang21c.html.
- Champion-level drone racing using deep reinforcement learning. Nature, 620(7976):982–987, 2023.
- Variance control for distributional reinforcement learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 17874–17895. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/kuang23a.html.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Sample dropout: A simple yet effective variance reduction technique in deep policy optimization. arXiv preprint arXiv:2302.02299, 2023.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013a.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013b.
- Asynchronous methods for deep reinforcement learning. arXiv: Learning,arXiv: Learning, Feb 2016.
- Compatible natural gradient policy search. Machine Learning, 108:1443–1466, 2019.
- Robust reinforcement learning using offline data, 2022.
- Stochastic variance-reduced policy gradient. In International conference on machine learning, pp. 4026–4035. PMLR, 2018.
- Sim-to-real transfer of robotic control with dynamics randomization. CoRR, abs/1710.06537, 2017. URL http://arxiv.org/abs/1710.06537.
- Pishro-Nik, H. Introduction to probability, statistics, and random processes. Kappa Research, LLC Blue Bell, PA, USA, 2014.
- Multi-goal reinforcement learning: Challenging robotics environments and request for research. CoRR, abs/1802.09464, 2018. URL http://arxiv.org/abs/1802.09464.
- Generalized proximal policy optimization with sample reuse. Advances in Neural Information Processing Systems, 34:11909–11919, 2021.
- Epopt: Learning robust neural network policies using model ensembles. CoRR, abs/1610.01283, 2016. URL http://arxiv.org/abs/1610.01283.
- Chebyshev inequality with estimated mean and variance. The American Statistician, 38(2):130–132, 1984.
- Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022.
- Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Pmlr, 2014.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Sobel, M. J. The variance of discounted markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
- V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control. International Conference on Learning Representations,International Conference on Learning Representations, Apr 2020.
- You may not need ratio clipping in ppo. 2022.
- Trust-region-free policy optimization for stochastic policies. arXiv preprint arXiv:2302.07985, 2023.
- Worst cases policy gradients, 2019.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sep 2012. doi: 10.1109/iros.2012.6386109. URL http://dx.doi.org/10.1109/iros.2012.6386109.
- Policy optimization through approximate importance sampling. arXiv preprint arXiv:1910.03857, 2019.
- Deep reinforcement learning with double q-learning, 2015.
- Truly proximal policy optimization. In Uncertainty in Artificial Intelligence, pp. 113–122. PMLR, 2020.
- Improving proximal policy optimization with alpha divergence. Neurocomputing, 534:94–105, 2023.
- Stochastic variance reduction for policy gradient estimation. arXiv preprint arXiv:1710.06034, 2017.
- Fully parameterized quantile function for distributional reinforcement learning. CoRR, abs/1911.02140, 2019. URL http://arxiv.org/abs/1911.02140.
- User-oriented robust reinforcement learning, 2022.
- On the global convergence of risk-averse policy gradient methods with expected conditional risk measures, 2023.
- Policy optimization via stochastic recursive gradient algorithm, 2019. In URL https://openreview. net/forum.
- Guard: A safe reinforcement learning benchmark. arXiv preprint arXiv:2305.13681, 2023.
- Non-crossing quantile regression for deep reinforcement learning. 2020. URL https://api.semanticscholar.org/CorpusID:231187649.
- Proximal policy optimization smoothed algorithm. CoRR, abs/2012.02439, 2020. URL https://arxiv.org/abs/2012.02439.
- Weiye Zhao (24 papers)
- Feihan Li (6 papers)
- Yifan Sun (183 papers)
- Rui Chen (310 papers)
- Tianhao Wei (25 papers)
- Changliu Liu (134 papers)