Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Guaranteed Trust Region Optimization via Two-Phase KL Penalization (2312.05405v1)

Published 8 Dec 2023 in cs.LG

Abstract: On-policy reinforcement learning (RL) has become a popular framework for solving sequential decision problems due to its computational efficiency and theoretical simplicity. Some on-policy methods guarantee every policy update is constrained to a trust region relative to the prior policy to ensure training stability. These methods often require computationally intensive non-linear optimization or require a particular form of action distribution. In this work, we show that applying KL penalization alone is nearly sufficient to enforce such trust regions. Then, we show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update while adding fewer than 5% additional gradient steps in practice. The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces, is easy to implement, and produces results competitive with other trust region methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Constrained policy optimization, 2017.
  2. Projections for approximate policy iteration algorithms. In International Conference on Machine Learning, 2019. URL https://api.semanticscholar.org/CorpusID:133597677.
  3. What matters in on-policy reinforcement learning? a large-scale empirical study, 2020.
  4. Deepmind lab, 2016.
  5. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013. doi: 10.1613/jair.3912. URL https://doi.org/10.1613%2Fjair.3912.
  6. Openai gym, 2016.
  7. Risk-constrained reinforcement learning with percentile risk criteria, 2015.
  8. Phasic policy gradient, 2020.
  9. Optimal auctions through deep learning: Advances in differentiable economics, 2017.
  10. Implementation matters in deep {rl}: A case study on {ppo} and {trpo}. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1etN1rtPB.
  11. Maximum entropy rl (provably) solves some robust rl problems, 2021.
  12. Information asymmetry in kl-regularized rl. ArXiv, abs/1905.01240, 2019.
  13. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
  14. Magnus R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4(5):303–320, November 1969. ISSN 1573-2878. doi: 10.1007/BF00927673. URL https://doi.org/10.1007/BF00927673.
  15. Batch size-invariance for policy optimization, 2021.
  16. Revisiting design choices in proximal policy optimization, 2020.

Summary

We haven't generated a summary for this paper yet.