Proximal Policy Optimization (PPO) Algorithm

Updated 1 February 2026

PPO is a policy-gradient algorithm featuring a clipped surrogate objective that ensures controlled updates and improved sample efficiency in continuous control tasks.
It employs multiple epochs of stochastic gradient ascent on on-policy samples while balancing exploration through entropy regularization and careful hyperparameter tuning.
Extensions like IEM-PPO and PPO-KL enhance exploration and stability, establishing PPO as a benchmark method for reinforcement learning research.

Proximal Policy Optimization (PPO) is a policy-gradient algorithm in reinforcement learning that achieves stable and efficient updates by constraining the deviation of successive policies through a clipped surrogate objective. PPO is renowned for its empirical performance and simplicity, especially in high-dimensional continuous control tasks. It is widely regarded as a standard for on-policy reinforcement learning, serving as a benchmark and foundation for numerous methodological innovations, extensions, and theoretical investigations (Schulman et al., 2017, Zhang et al., 2020).

1. Algorithmic Formulation and Surrogate Objective

PPO's core mechanism is to maximize an objective that balances improving estimated returns with restricting policy updates to remain close to the data-generating policy. The defining feature is the "clipped" surrogate objective:

$r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$

$L^{\text{CLIP}}(\theta)=\mathbb{E}_t\left[ \min\big( r_t(\theta)\hat{A}_t,\, \operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t \big) \right]$

where $\hat{A}_t$ is an estimator of the advantage function, and $\epsilon$ is a trust region parameter (typically 0.1–0.3) that bounds the policy ratio. The policy and value functions are updated using multiple epochs of stochastic gradient ascent over batches of on-policy samples, enabling efficient reuse of data (Schulman et al., 2017).

Typical pseudocode:

Initialize θ₀ (policy parameters), φ₀ (value parameters)
for k = 0,1,2,...:
    1. Collect trajectories Dₖ by running π_{θₖ}
    2. Compute rewards-to-go R̂_t and advantages Â_t
    3. Update policy:
        θ_{k+1} = argmax_θ (1/|Dₖ|T) ∑_{(s_t, a_t) in Dₖ}
                           min(r_t(θ)Â_t, clip(r_t(θ),1–ε,1+ε)Â_t)
    4. Update value function by fitting V_φ(s_t) to R̂_t

Entropy regularization is often used (bonus coefficient $\beta_{\text{ent}}$ ) to encourage policy exploration. Key hyperparameters include the clipping range $\epsilon$ , learning rates ( $\alpha_\pi$ , $\alpha_V$ ), entropy bonus, discount $\gamma$ , and GAE- $\lambda$ .

2. Theoretical Insights and Approximate Trust-Region

PPO was originally motivated as a first-order variant of Trust Region Policy Optimization (TRPO), which imposes a hard Kullback-Leibler (KL) constraint to ensure monotonic expected return improvement. Rather than solving a constrained problem, PPO approximates the trust region by clipping the policy ratio, directly penalizing updates that would violate the trust region (Schulman et al., 2017). When $r_t(\theta)$ remains within $[1-\epsilon, 1+\epsilon]$ , the surrogate reduces to a standard objective; outside this range, the min operation “flattens” the learning signal, discouraging large steps that could harm policy performance.

While PPO's clipped objective is a heuristic, it performs comparably to or better than TRPO in empirical settings, with substantial improvements in simplicity and wall-clock efficiency. However, unlike TRPO, PPO does not provide a formal guarantee of monotonic improvement in general, although monotonicity can sometimes be established for related variants using alternative geometric or trust-region penalties (Lascu et al., 4 Jun 2025, Zhu et al., 2020).

3. Exploration Characteristics and Limitations

Standard PPO employs a Gaussian policy for continuous actions, leading to isotropic exploration:

$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \operatorname{diag}(\sigma^2))$

This mechanism covers the action space uniformly but does not target high-uncertainty or high-potential areas, resulting in inefficient utilization of samples and susceptibility to local optima (Zhang et al., 2020). Sensitivity to the exploration scale (the standard deviation $\sigma$ ) means that exploration can either be too narrow (leading to premature convergence) or overly broad (introducing high variance in returns).

Attempts to alleviate these issues have included uncertainty-based intrinsic rewards, learned curiosity signals, or enhancement of exploration through dedicated modules. For example, the Intrinsic Exploration Module (IEM) in IEM-PPO uses an auxiliary neural network to estimate transition uncertainty, and rewards the agent for entering novel states, thereby directly augmenting PPO's exploration capacity (Zhang et al., 2020).

4. Algorithmic Variants and Extensions

Numerous PPO-based variants have been proposed to address specific limitations or exploit additional structure:

IEM-PPO: Augments PPO with an intrinsic uncertainty-based reward computed by a learned transition estimator $N_\xi(s_t, s_{t+n})$ , resulting in improved sample efficiency and robustness in MuJoCo benchmarks (Zhang et al., 2020).
ICM-PPO: Incorporates curiosity-driven intrinsic rewards via forward prediction of observations.
PPO-KL: Replaces clipping by a soft or adaptive KL-divergence penalty, controlling the update step by regularizing toward the previous policy (Guo et al., 2021).
PPO-Clip in RKHS: Correntropy-induced metrics (CIM) have been proposed to replace the asymmetric KL penalty with a symmetric RKHS-based metric for trust-region regularization (Guo et al., 2021).
Pb-PPO: Applies a bi-level, preference-optimization approach, dynamically selecting the clipping bound $\epsilon$ via a multi-armed bandit scheme to maximize cumulative return (Zhang et al., 2023).
GI-PPO: Integrates analytical gradients in differentiable environments, adaptively controlling the contribution of reparameterization-based policy improvement by estimating analytical-gradient reliability (Son et al., 2023).
ToPPO: Enables theoretically justified, trust-region-aware off-policy data reuse by mixing current and past-policy trajectories in the PPO update, with monotonicity guarantees subject to a rolling trust-region policy set (Gan et al., 2024).
Functional/Smoothed Clipping: The PPOS algorithm substitutes tanh-based smooth clipping for the original flat regime, improving gradient flow and convergence (Zhu et al., 2020).
Advantage Modulation (AM-PPO): Applies adaptive, nonlinear scaling to advantage estimates prior to policy and value updates for improved optimization stability and sample efficiency (Sane, 21 May 2025).

5. Empirical Evaluation and Performance in Continuous Control

PPO provides state-of-the-art performance on standard continuous control benchmarks (e.g., MuJoCo's HalfCheetah-v2, Hopper-v2, Walker2d-v2, Swimmer-v2) (Zhang et al., 2020, Schulman et al., 2017). Notable outcomes from recent experimentation include:

Algorithm	HalfCheetah	Swimmer	Hopper	Walker2d
PPO	4824 ± 545	242 ± 4	2060 ± 883	2761 ± 1203
ICM-PPO	4834 ± 571	324 ± 3	2018 ± 828	2801 ± 1214
IEM-PPO	5074 ± 360	368 ± 2	2159 ± 770	2971 ± 1108

IEM-PPO demonstrates higher final return and lower variance relative to both vanilla PPO and curiosity-based alternatives. Enhanced exploration leads to accelerated early learning and greater asymptotic performance, albeit at the cost of an approximately 20–30% increase in training time (due to additional network forward/backward passes per step). Robustness to choice of action noise is also improved under uncertainty-augmented exploration (Zhang et al., 2020).

6. Practical Considerations and Best Practices

Hyperparameter selection: The clipping parameter $\epsilon$ should typically fall within [0.1, 0.3], with policy and value learning rates $\approx 3 \times 10^{-4}$ and $1 \times 10^{-3}$ , respectively, for Adam. Entropy regularization (coefficient within [0.0, 0.01]) tangibly aids exploration and policy entropy maintenance.
Policy/value architecture: Two-layer neural networks with moderate width (e.g., 64 units and tanh activation) are commonly employed.
Normalization and regularization: Normalizing advantages, value targets, and using early stopping criteria based on average KL divergence (e.g., stopping updates if KL exceeds $0.015$) contribute to improved training stability.
When to deploy enhanced exploration: In tasks with dense rewards but significant risk of suboptimal convergence due to poor exploration, augmenting PPO with uncertainty-based intrinsic rewards (e.g., IEM-PPO) is warranted. This is especially vital when operating in complex, high-dimensional environments where isotropic Gaussian action sampling is inefficient.

7. Limitations, Open Questions, and Future Directions

PPO's primary weakness is in its assumption that isotropic Gaussian-action exploration suffices for efficient learning. This regime fails in environments with multimodal reward landscapes or significant local optima (Zhang et al., 2020). Extensions leveraging multi-modal policies, automatic adaptation of intrinsic reward coefficients (e.g., $c_1$ , $\beta$ ), and integration into hierarchical or multi-agent frameworks remain open avenues.

Potential future research includes:

Incorporation of multi-modal or non-Gaussian policy families to enhance exploration in multi-peak reward structures.
Automated, task-driven tuning of reward balancing parameters.
Extension of uncertainty-driven exploration to settings requiring coordinated or temporally extended exploration, such as in hierarchical or multi-agent RL.

PPO retains its centrality owing to its empirical reliability, theoretical connection to trust region schemes, and extensibility to a wide array of domains and architectures (Schulman et al., 2017, Zhang et al., 2020, Gan et al., 2024).