Simple Policy Optimization (2401.16025v6)

Published 29 Jan 2024 in cs.LG

Abstract: As one of the most important and influential algorithms in reinforcement learning, the Proximal Policy Optimization (PPO) algorithm has demonstrated outstanding performance across various domains. It simplifies the optimization-based importance sampling process of the Trust Region Policy Optimization (TRPO) algorithm through ratio clipping. However, this simplification with ratio clipping does not always effectively enforce trust region constraints. In this paper, we introduce an algorithm named \textit{Simple Policy Optimization} (SPO), which incorporates a novel clipping method for the KL divergence between the old and new policies. Extensive experimental results in both \textit{Atari 2600} and \textit{MuJoCo} environments show that, compared to PPO, SPO achieves better sample efficiency, extremely low KL divergence, and higher policy entropy, while also being robust to increases in network depth or complexity. More importantly, SPO maintains the simplicity of an unconstrained first-order algorithm. Our code is available at https://github.com/MyRepositories-hub/Simple-Policy-Optimization.

References (41)

Authors (3)

Zhengpeng Xie (5 papers)
Qiang Zhang (467 papers)
Renjing Xu (72 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Simple Policy Optimization (2401.16025v6)

Summary

Related Papers