Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 147 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 188 tok/s Pro

GPT OSS 120B 398 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

PPO and GRPO in Deep Reinforcement Learning

Updated 20 October 2025

PPO and GRPO are advanced policy gradient methods in deep RL that use clipped updates and group normalization to enhance stability.
PPO simplifies trust region methods by employing a clipping mechanism, leading to faster convergence in fields like robotics and continuous control.
GRPO replaces value-based advantage estimation with group-normalized statistics, enabling scalable training for large language models and structured control tasks.

Proximal and Group Relative Policy Optimization (PPO and GRPO) are influential classes of policy gradient algorithms in deep reinforcement learning, developed to address stability, efficiency, and scalability in policy optimization. PPO, introduced as a simplified yet robust alternative to Trust Region Policy Optimization (TRPO), has become a default method in numerous single-agent and multi-agent RL applications. Group Relative Policy Optimization extends the PPO paradigm by replacing value-based advantage estimation with group-normalized empirical statistics, enabling effective large-scale optimization without a critic function—most notably for post-training LLMs and structured control problems with richly variant feedback.

1. Proximal Policy Optimization: Methodology and Principles

Proximal Policy Optimization (PPO) (Schulman et al., 2017) is built around the principle of constraining the magnitude of policy updates to avoid destructive large steps and oscillations in policy-gradient RL. The key innovation is the introduction of a clipped surrogate objective, which modifies the standard policy gradient update:

$L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t|s_t)}$ is the importance sampling ratio, $\hat{A}_t$ is the advantage estimate, and $\epsilon$ is a hyperparameter controlling the size of the trust region.

PPO alternates between data collection (running the current policy to obtain trajectories) and policy optimization (performing multiple epochs of mini-batch SGD on the fixed batch, with the above objective). This decouples data sampling and optimization, whereas classic policy-gradient approaches perform only one gradient step per sample.

Relative to TRPO, PPO avoids complex Hessian-vector product computations and second-order constraints by using this first-order, easily-implemented clipping mechanism. Empirically, PPO shows faster convergence and higher average returns across a broad set of benchmarks such as continuous control (MuJoCo), Atari, and robotics tasks. The method is especially advantageous where sample complexity and wall-clock efficiency are at a premium (Schulman et al., 2017).

2. Group Relative Policy Optimization: Key Mechanisms and Motivation

Group Relative Policy Optimization (GRPO) emerged in the context of critic-free RL and preference-based LLM alignment (Pang et al., 4 Aug 2025, Mroueh et al., 28 May 2025, Wu et al., 1 Oct 2025). GRPO replaces the value-based advantage function typically used in PPO with a group-normalized reward structure. For a given input (e.g., a prompt for an LLM), a group $G$ of trajectories (rollouts) is generated under a reference policy $\pi_{\mathrm{old}}$ ; the advantage for trajectory $i$ is computed as

$A_i = \frac{r_i - \mu_G}{\sigma_G + \delta}$

with $r_i$ the scalar reward, $\mu_G$ the group mean, $\sigma_G$ the group standard deviation, and $\delta$ a small constant for numerical stability.

The PPO-style update rule is then repurposed:

$\mathcal{L}_{\mathrm{GRPO}}(\theta) = \frac{1}{|G|}\sum_{i=1}^{|G|} \sum_{t=1}^{T} \min \left\{ w(s_t^{(i)}, \theta, \theta_{\mathrm{old}}) A_i,\, \mathrm{clip}(w(s_t^{(i)}, \theta, \theta_{\mathrm{old}}), \varepsilon_{\text{low}}, \varepsilon_{\text{high}}) A_i \right\} - \beta\,\mathrm{KL}(\pi_\theta \| \pi_{\mathrm{ref}})$

Here, $w(s_t, \theta, \theta_{\mathrm{old}}) = \pi_\theta(s_t | s_{t-1})/\pi_{\theta_{\mathrm{old}}}(s_t | s_{t-1})$ is the token-level importance ratio and $\pi_{\mathrm{ref}}$ optionally acts as a reference policy for KL regularization.

GRPO does not require estimation or backpropagation of a value function (critic network), instead relying on intra-group statistics of sampled rewards. This reduces memory footprint and computational resources and simplifies implementation, particularly in LLM-scale models (Pang et al., 4 Aug 2025, Mroueh et al., 28 May 2025).

3. Trajectory-Based and Contrastive Variants of GRPO

A significant theoretical development is the reinterpretation of GRPO as a form of contrastive learning, closely related to Direct Preference Optimization (DPO) (Wu et al., 1 Oct 2025). The contrastive perspective frames learning as maximizing the difference between positive and negative samples. With a minimal group size of two (2-GRPO)—i.e., one positive and one negative rollout—the unbiasedness and efficacy of the gradient estimation are preserved, despite the dramatic reduction in computational overhead. Specifically, the paper demonstrates that, for binary rewards, the 2-GRPO update suffices to provide normalized contrastive feedback, allowing sublinear scaling of complexity with respect to group size.

Furthermore, TIC-GRPO (Trajectory-level Importance Corrected GRPO) was proposed to address bias inherent in token-level importance sampling by replacing it with trajectory-level ratios (Pang et al., 4 Aug 2025). The update becomes

$\mathcal{L}_{\mathrm{TIC-GRPO}}(\theta) = \frac{1}{|G|}\sum_{i=1}^{|G|} \min\left\{w'(s_T^{(i)}, \theta, \theta_{\mathrm{old}}) A_i,\, \mathrm{clip}(w'(s_T^{(i)}, \theta, \theta_{\mathrm{old}}), \varepsilon) A_i\right\} - \beta\,\mathrm{KL}(\pi_\theta \| \pi_{\mathrm{ref}})$

where $w'(\cdot)$ is the full-trajectory importance ratio. This update yields an unbiased estimator of the policy gradient at the current iterate, as opposed to standard GRPO which (due to its staleness in sampling) technically tracks the gradient at the fixed "old" policy (Pang et al., 4 Aug 2025).

4. Theoretical Guarantees and Empirical Performance

Convergence results for GRPO-style algorithms indicate that, under assumptions of Lipschitz continuity, smooth KL, and bounded rewards, averaged squared norm of the gradient decays at a rate $\mathcal{O}(\eta K) + \mathcal{O}(1/|G|)$ , where $\eta$ is the learning rate, $K$ the number of inner steps, and $|G|$ the group size (Pang et al., 4 Aug 2025). Performance degradation due to group size is mild; empirical findings for both LLM post-training and decision-making tasks show that 2-GRPO achieves empirical accuracy and "pass" rates nearly identical to standard large-group GRPO, with more than $70\%$ reduction in training time and $1/8$ the number of rollouts (Wu et al., 1 Oct 2025).

Empirically, GRPO-based methods rival or outperform PPO in RL domains such as fluid antenna system optimization (Zhang et al., 18 Sep 2025), where GRPO achieved higher sum-rate performance than PPO while requiring only about $49.2\%$ of the computational resources. In structured control domains, increasing the group size or trajectory length did not lead to significant further improvements, suggesting conservative parameter selection suffices.

5. Extensions and Practical Implementations

Hybrid approaches that integrate empirical multi-sample evaluation with value-based stability have been proposed (Sane, 30 Jan 2025), blending PPO's bootstrapped value function with group-based empirical statistics. These "Hybrid GRPO" variants deliver faster convergence, improved stability, and enhanced sample efficiency in simulation. Further extensions include:

Entropy regularization for robust exploration.
Hierarchical or multi-step trajectory subsampling to better capture long-horizon dependencies.
Adaptive normalization and learned value models to guide action sampling and mitigate variance.

VoiceGRPO incorporates Mixture-of-Experts Transformers and GRPO to achieve near-perfect classification on synthetic voice pathology datasets, demonstrating F1 and ROC-AUC metrics surpassing those of PPO-based baselines. Group-wise normalization and trust region updates again play a central role in stabilizing learning (Togootogtokh et al., 5 Mar 2025).

Off-policy GRPO generalizes by employing importance-sampled ratios with respect to a (potentially stale) sampling policy, further improving training stability, communication efficiency, and sample reuse—critical in large-scale LLM server environments (Mroueh et al., 28 May 2025). With proper masking of zero-variance group samples, off-policy GRPO stably matches or exceeds on-policy performance, reducing sample and communication costs.

6. Relationship to PPO, Trust Region Methods, and Further Adaptations

PPO and its trust region relatives (such as TRPO, adaptive and bidirectional trust-region updates, and policy optimization with dynamic clipping (Zhang et al., 2023, Rahman, 23 May 2025)) remain foundational. Bidirectional trust-region variants fuse entropy-driven exploration and reward-guided convergence into a unified adaptive mechanism, outperforming PPO and GRPO in sample efficiency and stability (e.g., PPO-BR achieves $29.1\%$ faster convergence and $2.3\times$ lower reward variance than PPO (Rahman, 23 May 2025)).

GRPO and its trajectory-corrected variants stand out for cases where critic parameterization is costly or infeasible and group-level or relative rewards are fundamental, such as preference-based fine-tuning of LLMs, communication systems, and gray-box control. The theoretical equivalence between GRPO and DPO in the minimal two-rollout case clarifies the underlying contrastive information structure present in both algorithms (Wu et al., 1 Oct 2025).

7. Significance, Limitations, and Prospects

PPO's core generality and robust engineering properties make it the preferred baseline for deep RL, while GRPO and related group-normalized methods enable scalable, critic-free optimization in high-variance, large-scale, and preference-based environments. These developments have immediate applications in reinforcement learning for LLMs, adaptive wireless systems, multi-agent robotics, and domain-specific control. Key trade-offs remain in the choice of group size (balancing computation and gradient variance), off-policy versus on-policy sampling, and the balance between empirical rewards and value-based bootstrapping.

Continued research on theoretical guarantees, variance reduction, and extensions to hierarchical and multi-agent domains will further define the landscape of proximal and group relative policy optimization methods.