Guaranteed Trust Region Optimization via Two-Phase KL Penalization (2312.05405v1)
Abstract: On-policy reinforcement learning (RL) has become a popular framework for solving sequential decision problems due to its computational efficiency and theoretical simplicity. Some on-policy methods guarantee every policy update is constrained to a trust region relative to the prior policy to ensure training stability. These methods often require computationally intensive non-linear optimization or require a particular form of action distribution. In this work, we show that applying KL penalization alone is nearly sufficient to enforce such trust regions. Then, we show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update while adding fewer than 5% additional gradient steps in practice. The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces, is easy to implement, and produces results competitive with other trust region methods.
- Constrained policy optimization, 2017.
- Projections for approximate policy iteration algorithms. In International Conference on Machine Learning, 2019. URL https://api.semanticscholar.org/CorpusID:133597677.
- What matters in on-policy reinforcement learning? a large-scale empirical study, 2020.
- Deepmind lab, 2016.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013. doi: 10.1613/jair.3912. URL https://doi.org/10.1613%2Fjair.3912.
- Openai gym, 2016.
- Risk-constrained reinforcement learning with percentile risk criteria, 2015.
- Phasic policy gradient, 2020.
- Optimal auctions through deep learning: Advances in differentiable economics, 2017.
- Implementation matters in deep {rl}: A case study on {ppo} and {trpo}. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1etN1rtPB.
- Maximum entropy rl (provably) solves some robust rl problems, 2021.
- Information asymmetry in kl-regularized rl. ArXiv, abs/1905.01240, 2019.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Magnus R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4(5):303–320, November 1969. ISSN 1573-2878. doi: 10.1007/BF00927673. URL https://doi.org/10.1007/BF00927673.
- Batch size-invariance for policy optimization, 2021.
- Revisiting design choices in proximal policy optimization, 2020.