Proximal Policy Optimisation (PPO)
- Proximal Policy Optimisation is a reinforcement learning method that uses clipped surrogate objectives to limit policy update magnitudes.
- The algorithm alternates between collecting data and performing repeated first-order gradient updates, achieving high sample efficiency and stability.
- Empirical results show PPO's robust performance across diverse control tasks and its applicability to both discrete and continuous action domains.
Proximal Policy Optimisation (PPO) is an influential family of policy gradient algorithms for reinforcement learning (RL) that combines aspects of trust-region methods with first-order stochastic optimization to achieve high empirical sample efficiency and robustness across a wide range of control tasks. Introduced by Schulman et al. (2017), PPO has become the predominant on-policy RL algorithm for both continuous and discrete domains due to its simplicity, generality, and stable performance characteristics (Schulman et al., 2017).
1. Surrogate Objectives and Core Algorithmic Principle
PPO centers its updates around surrogate objectives that maintain policy iterates within a soft trust region, preventing destructive updates but allowing effective optimisation with standard stochastic gradient methods. The canonical form is the "clipped" surrogate objective: where is the per-timestep probability ratio and an estimator of the advantage function (Schulman et al., 2017). The hyperparameter bounds allowed policy deviations per update.
A KL-penalty alternative augments the REINFORCE/actor-critic surrogate with an explicit regularizer,
Unlike TRPO, PPO does not rely on constrained second-order updates; is adaptively tuned to keep the empirical KL near a target (Schulman et al., 2017).
2. Theoretical Motivation and Intuition
PPO is motivated by challenges inherent to vanilla policy gradient and strict trust region algorithms. Unconstrained gradient steps can push the policy distribution too far, causing severe performance collapse. Trust Region Policy Optimization (TRPO) constrains updates to a region of small average KL-divergence, but introduces complex constrained optimization machinery.
PPO approximates the trust region principle with either the clipped ratio or KL regularizer, achieving a soft (per-sample) bound on policy update magnitude. The clipping operation ensures that the new policy's per-state change is limited by , preventing ratio explosions due to rare-event likelihoods while preserving the simplicity of first-order updates and general applicability to both discrete and continuous action spaces. This property underpins the empirical stability and broad adoption of PPO (Schulman et al., 2017).
3. Algorithmic Workflow and Implementation
PPO alternates between two phases:
- Data Collection: Using the current policy , interact with the environment to gather a trajectory batch of timesteps.
- Optimisation: For epochs, repeatedly sample minibatches of size from the collected batch, compute the importance ratio , the surrogate loss for each minibatch, and update via (typically Adam) first-order gradient ascent. The value function is concurrently updated with a squared error loss. Optionally, an entropy bonus is used to promote exploration (Schulman et al., 2017).
The pseudocode structure is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for iteration = 1,2,... do
1. Collect trajectories {tau_i} with policy pi_{theta_old}, total T steps
2. Compute GAE advantages Ahat_t, targets R_t for each t
3. theta <- theta_old
4. for epoch = 1 ... N do
Shuffle T samples into minibatches of size M
For each minibatch B:
- Compute r_t(theta) for t in B
- Compute loss L^{CLIP}_B(theta)
- Backpropagate, update theta and value function parameters
end
5. theta_old <- theta
end |
Key defaults (MuJoCo/Atari) include , , , , , (Schulman et al., 2017).
4. Empirical Results and Performance Trade-offs
PPO achieves or surpasses TRPO on benchmark suites such as MuJoCo robotic locomotion (Hopper, Walker, HalfCheetah, Ant) and Atari 2600 discrete control environments. It outperforms alternative online policy gradient methods on final reward, sample complexity and wall-clock speed due to the avoidance of expensive second-order computation (Schulman et al., 2017). The surrogate with clipping offers better or comparable wall-time performance to simple actor-critic, with enhanced robustness.
The KL-penalty variant typically yields slightly lower final reward compared to clipping; however, it allows for more tunable tradeoffs in update size. In practice, PPO requires fewer simulator steps than TRPO to reach a given performance threshold and is significantly faster per iteration, particularly in GPU-accelerated settings.
Main empirical findings:
- Sample efficiency: fewer environment interactions to optimal performance than A2C, ACKTR, and TRPO.
- Stability: consistent updates prevent collapse even in high-variance reward settings (Schulman et al., 2017).
5. Strengths, Weaknesses, and Hyperparameter Sensitivity
Strengths
- Simplicity: Requires only standard stochastic gradients; no conjugate gradient, Fisher-matrix or Hessian computations.
- Generality: Works across discrete and continuous domains without domain-specific adaptation.
- Stability: Surrogates with clipping/KL-penalty deliver soft trust regions, curbing destructive updates and promoting safe exploration.
- Empirical sample efficiency: On numerous tasks, PPO achieves or exceeds the efficiency of both first- and second-order methods (Schulman et al., 2017).
Weaknesses
- Hyperparameter Sensitivity: Performance is sensitive to , batch size, learning rates. Poor choices can result in suboptimal learning or instability.
- Lack of Hard Guarantee: The surrogate mechanism does not formally guarantee monotonic improvement (in contrast to TRPO), and—under rare conditions—may diverge if ratios are not correctly bounded (Schulman et al., 2017).
- Empirical Trade-offs: The clipping surrogate generally outperforms the KL-penalty, but requires practitioners to choose and tune the appropriate formulation.
6. Extensions and Influence
PPO's introduction catalyzed a large suite of extensions across distributed, multi-agent, and off-policy settings, as well as innovations in surrogate objectives and exploration strategies. The mixed-distributed PPO variant (MDPPO) introduces multiple independently updated policies that share successful trajectories, expediting convergence and further stabilizing updates, especially in sparse-reward and high-variance domains (Zhang et al., 2019). Other work explores alternative penalty methods, interior-point surrogates, predictive-processing losses, and integration with analytical gradients for differentiable environments. These extended frameworks continue to leverage PPO's core principle of robust, stable first-order optimization while addressing emergent limitations such as sample inefficiency, poor exploration, or credit assignment (Schulman et al., 2017, Zhang et al., 2019).
7. Comparative Summary and Adoption
PPO has become the default RL algorithm in many practical and research applications, combining robust empirical performance with ease of use and broad applicability. Its clipped surrogate mechanism effectively augments the REINFORCE/A2C family, enabling multiple SGD passes per batch while mitigating the risk of overfitting or policy collapse—issues characteristic of classical policy gradient approaches. Its strong empirical record across continuous control and high-dimensional, discrete-action environments cements its status as a standard baseline and point of departure for contemporary RL research and benchmarking (Schulman et al., 2017).
References:
- Schulman, J., et al., "Proximal Policy Optimization Algorithms" (Schulman et al., 2017).
- Wang, L. et al., "Proximal Policy Optimization with Mixed Distributed Training" (Zhang et al., 2019).