On-Policy Optimization in RL

Updated 30 November 2025

On-policy optimization is a reinforcement learning framework that updates policies using data exclusively generated by the current or recent policies.
It employs surrogate objective functions with trust-region constraints and clipping mechanisms to ensure monotonic improvement at each update.
Recent variants such as PPO, TRPO, and GePPO enhance stability and sample efficiency through adaptive regularization and controlled sample reuse.

On-policy optimization denotes a family of reinforcement learning (RL) algorithms wherein all policy updates are based strictly on data generated by the current policy or a small mixture of recent policies. This paradigm underpins key trust-region and policy-gradient approaches in sequential decision-making, favoring robust monotonic improvement at each step, often at the expense of sample efficiency relative to off-policy methods. On-policy optimization encompasses explicit policy-iteration, policy-gradient, and their modern variants equipped with trust-region and clipping mechanisms. Theoretical guarantees for monotonic improvement, the precise structure of surrogate objectives, trade-offs between step-size and policy divergence, and recent extensions for accelerated or robust learning form the foundations of the field.

1. Core Principles and Theoretical Foundations

On-policy optimization is fundamentally derived from policy improvement theory in MDPs, which quantifies the expected return difference between a new candidate policy π and the current policy πₖ via a surrogate lower bound. For discounted problems,

$J(\pi) - J(\pi_k) \geq \frac{1}{1-\gamma} \mathbb{E}_{(s,a)\sim d^{\pi_k}}\left[ \frac{\pi(a|s)}{\pi_k(a|s)} A^{\pi_k}(s,a) \right] - \frac{2\gamma C^{\pi,\pi_k}}{(1-\gamma)^2} \mathbb{E}_{s\sim d^{\pi_k}}[\mathrm{TV}(\pi,\pi_k)(s)]$

where $A^{\pi_k}(s,a)$ is the advantage, $C^{\pi,\pi_k}$ is a bound on advantage variation, and $\mathrm{TV}$ is the total-variation distance over actions. This guarantee ensures that, provided the policy shift is constrained (i.e., TV or KL divergence is small), every update results in non-negative improvement (Queeney et al., 2021).

To maintain monotonic improvement, modern on-policy algorithms constrain policy updates via explicit trust-region penalties or surrogate clipping:

Surrogate Optimization: Replace $J(\pi)$ with a sample-based lower bound or a local linear/quadratic surrogate, maximizing tractable objectives that match gradients at the current policy.
Trust-Region Constraint: Enforce a KL-divergence or total-variation constraint on $\pi(\cdot|s)$ to limit distributional shift per update (Schulman et al., 2015).
Clipping: Instead of explicit constraints, clip the likelihood ratio between the candidate and behavior policy within a set range, thereby approximating a trust region (Schulman et al., 2017).

Extensions to mixtures of recent policies or off-policy data require additional penalties for the increased statistical divergence, but the same lower-bounding and monotonicity logic applies (Queeney et al., 2021, Iwaki et al., 2017).

2. Algorithmic Structures: Surrogate Objectives and Update Rules

The operational workflow of on-policy optimization typically involves the following phases:

On-Policy Data Collection: Interact with the environment under policy πₖ for $n$ steps, generating a batch for update.
Advantage Estimation: Compute advantage estimates, e.g., via Generalized Advantage Estimation (GAE) or group-relative/empirical means (Schulman et al., 2017, Mroueh et al., 28 May 2025).
Surrogate Objective Construction: Formulate the core optimization objective, such as:

$L^{\text{PPO}}_k(\pi) = \mathbb{E}_{(s,a)\sim d^{\pi_k}} \left[ \min \left( r_k(s,a)A^{\pi_k}(s,a), \text{clip}(r_k(s,a), 1-\epsilon, 1+\epsilon)A^{\pi_k}(s,a) \right) \right]$

where $r_k(s,a)=\pi(a|s)/\pi_k(a|s)$ (Schulman et al., 2017, Queeney et al., 2021).

Policy Update with Regularization: Optimize the surrogate via stochastic gradient methods, subject to TV/KL or clip-based trust region enforcement.

Proximal algorithms (TRPO, PPO, GePPO) and their derivatives perform multiple epochs of minibatch updates per data batch, computing gradients with respect to the surrogate, and regulating the policy shift via hyperparameters such as KL bound $\delta$ or clip range $\epsilon$ (Schulman et al., 2015, Queeney et al., 2021).

Recent variants employ additional mechanisms:

Reflective Policy Optimization (RPO) introduces future-trajectory introspection via additional surrogate terms to contract the feasible policy update set, thus accelerating convergence while maintaining monotonicity (Gan et al., 2024).
Generalized PPO (GePPO) interpolates between pure on-policy and partial off-policy update regimes by incorporating mixture batches of previous policies, with adaptive learning rates to control cumulative policy divergence (Queeney et al., 2021).
V-MPO and related EM-style algorithms alternate between building a nonparametric “target” policy by reweighting high-advantage actions, and fitting the parametric policy via constrained maximum likelihood under a KL bound (Song et al., 2019).

3. Stability–Efficiency Trade-offs and Sample Reuse

Balancing stability (monotonic improvement, low-variance updates) and sample efficiency is central in on-policy optimization:

Stability: Strict trust-region enforcement or conservative clipping ensures every update is safe, even for high-dimensional, nonconvex policies (Schulman et al., 2015, Schulman et al., 2017).
Sample Efficiency: Traditional on-policy methods discard previous data per update; however, empirical and theoretical advances such as GePPO show that principled reuse of up to $M$ recent policy batches (with appropriate mixture weights) can boost effective batch size or update frequency without violating stability (Queeney et al., 2021).
Adaptive Schedules: Methods like P3O adaptively toggle between on-policy and off-policy gradient steps based on effective sample size (ESS), automatically regularizing the update as overlap with old policies diminishes (Fakoor et al., 2019).

The table below illustrates three key approaches:

Algorithm	Policy Shift Constraint	Sample Reuse
TRPO	KL-divergence hard bound	No (pure on-policy)
PPO	Clipped likelihood ratio	Bounded epoch reuse
GePPO	TV/clip + mixture penalties	Recent batch mixture (M)

By varying the depth of sample reuse and the strength of trust-region penalties, modern approaches can traverse the stability–efficiency Pareto frontier (Queeney et al., 2021, Fakoor et al., 2019).

4. Unified Perspectives: Gradient Forms and Surrogate Parameterizations

A unified analytical perspective classifies on-policy algorithms by their “form” (direction of the ascent, e.g., vanilla PG vs. advantage-centered) and “scale” (how the advantage/estimator modulates the update, including reweighting and clipping):

Form: PG (likelihood ratio gradient), PGPB (advantage-centered, control-variated), or entropy-augmented variants.
Scale: Choices range from identity (REINFORCE), first-order corrections (TRPO), maximum-likelihood-inspired scaling, to clipped forms (PPO) (Gummadi et al., 2022).

All classical and recent on-policy methods (REINFORCE, TRPO, PPO, V-MPO, SIL, GePPO) fit within this two-dimensional space, with semantics for bias–variance trade-off, step-size selection, and trust-region management (Gummadi et al., 2022, Schulman et al., 2017, Song et al., 2019).

5. Structural Variants and Extensions

On-policy optimization exhibits wide methodological diversity:

Discrete Action Parameterization: Discretizing continuous action spaces into factorized categorical distributions, optionally with ordinal stick-breaking, results in richer and more expressive policies that outperform Gaussians on high-dimensional control tasks and remain compatible with on-policy objectives (Tang et al., 2019).
Wasserstein Trust Regions: Viewing policy updates as Wasserstein gradient flows over probability measures yields a geometric justification for step-size and trust-region choices, generalizing natural gradients and suggesting new proximal surrogates (Zhang et al., 2018).
Entropy Regularization: Adding entropy bonuses smooths the policy landscape, aiding optimization by connecting local minima, expanding the range of stable learning rates, and improving exploration (Ahmed et al., 2018).
Robust MDPs and Optimistic Updates: On-policy approaches incorporating robustness to MDP kernel uncertainty make use of Fenchel-conjugate duality, maintaining monotonic regret under adversarial transition dynamics (Dong et al., 2022).

Recent research geometrizes policy learning (Wasserstein flows), systematizes parameter update schedule adaptation (meta-gradient optimism), and blends on- and off-policy regimes to push the sample-efficiency envelope while maintaining performance guarantees (Chelu et al., 2023, Queeney et al., 2021, Fakoor et al., 2019).

6. Empirical Outcomes and Practice

On-policy optimization algorithms, when equipped with advanced surrogates and carefully tuned trust-region or clipping parameters, consistently deliver strong performance across high-dimensional simulated benchmarks:

PPO and TRPO attain robust, monotonic improvements with minimal hyperparameter tuning, as shown in simulated locomotion (MuJoCo), complex games (Atari-2600), and real-world-inspired control tasks (Schulman et al., 2017, Schulman et al., 2015).
GePPO achieves 8–65% higher average returns than PPO with 15–77% fewer samples on diverse MuJoCo tasks, through safe sample reuse and adaptive learning-rate control (Queeney et al., 2021).
Reflective Policy Optimization accelerates convergence and improves sample efficiency by contracting the solution space, outperforming prior on-policy baselines across a wide suite of continuous and discrete environments (Gan et al., 2024).
Group Relative Policy Optimization avoids reliance on value critics by using on-policy group-normalized reward advantages, matching or exceeding off-policy alternatives in sequence modeling tasks (Mroueh et al., 28 May 2025).
Architectural integration with interpretable models (e.g., ANFIS) is feasible; PPO reliably trains neuro-fuzzy controllers with faster convergence and reduced variance versus off-policy (DQN) frameworks (Shankar et al., 22 Jun 2025).

Empirical ablations consistently support the view that tight trust-region enforcement, judicious surrogate construction, and, when possible, controlled sample reuse and regularization are critical to stability, efficiency, and ultimate performance.

7. Limitations, Open Problems, and Future Directions

Despite strong theoretical and empirical grounding, on-policy optimization methods face several well-established limitations:

Sample Efficiency: Pure on-policy algorithms remain substantially less sample efficient than off-policy variants unless augmented with safe sample reuse (e.g., GePPO, P3O).
Hyperparameter Tuning: Performance can be sensitive to trust-region or clipping thresholds, learning rates, and advantage estimation parameters; adaptive schedules remain an important area of research (Queeney et al., 2021, Schulman et al., 2017).
Variance/Bias Trade-off: The choice of surrogate, advantage estimator, and scaling affects the balance between variance reduction and estimator bias; principled meta-learning of update rules is an active area (Chelu et al., 2023).
Architectural Restrictions: Some EM-based and surrogate-tightening methods presuppose log-concave policies or may struggle with arbitrary deep-net architectures (Roux, 2016).
Robustness: Standard on-policy methods are not robust to model misspecification, prompting extensions for adversarial MDP uncertainty (Dong et al., 2022).
Generalization to Off-Policy Regimes: While analytical update rules now permit direct application of on-policy surrogates in off-policy data settings, controlling divergence penalties and estimation error remains nontrivial (Li et al., 2021).

Future research targets include deeper meta-adaptation of optimization schedules (optimism/adaptivity), robustified policy surrogates, principled blending of on- and off-policy data, interpretability, and unifying geometric perspectives on policy space optimization. The interplay between surrogate construction, trust-region management, and algorithmic acceleration continues to drive innovation in on-policy RL.

References:

(Schulman et al., 2015, Schulman et al., 2017, Queeney et al., 2021, Gan et al., 2024, Gummadi et al., 2022, Tang et al., 2019, Zhang et al., 2018, Mroueh et al., 28 May 2025, Song et al., 2019, Shankar et al., 22 Jun 2025, Fakoor et al., 2019, Dong et al., 2022, Ahmed et al., 2018, Chelu et al., 2023, Roux, 2016, Li et al., 2021, Dann et al., 2023, Iwaki et al., 2017)