ExGRPO: Efficient RL Experience Optimization

Updated 4 October 2025

ExGRPO is a reinforcement learning framework that reuses high-value trajectories to improve sample efficiency and training stability.
It prioritizes medium-difficulty experiences using correctness scoring and entropy-based selection to optimize policy updates.
The framework integrates on-policy exploration with off-policy replay through importance sampling and policy shaping for robust learning.

ExGRPO (Experiential Group Relative Policy Optimization) designates a family of reinforcement learning algorithms—centered on group-normalized policy optimization with experience management and replay—aimed at improving the reasoning capabilities of LLMs and agentic systems. It builds on Group Relative Policy Optimization (GRPO), addressing sample inefficiency and instability in RL for verifiable rewards by prioritizing, reusing, and shaping past valuable experiences. ExGRPO integrates on-policy and off-policy optimization, importance sampling corrections, and policy shaping, yielding state-of-the-art sample efficiency and stability in mathematical reasoning and sequence generation tasks (Zhan et al., 2 Oct 2025).

1. Core Principles and Motivation

ExGRPO extends the foundational GRPO approach by explicitly exploiting the value of prior rollout trajectories. In classical RLVR (reinforcement learning from verifiable rewards), rollouts from the current policy are used for a single update then discarded, neglecting the considerable information conveyed by successful or partially correct prior reasoning paths. ExGRPO’s principal innovation is to systematically collect, score, organize, and selectively replay these trajectories in a manner that enhances both sample efficiency and learning stability (Zhan et al., 2 Oct 2025).

Key design motivations:

Principled experience reuse: Exploiting the learning signal in prior (often rare, high-quality) trajectories, in contrast to the discard-everything approach of on-policy RL.
Difficulty-based bucketing: Empirical findings indicate that replaying experiences from medium-difficulty (i.e., ‘informative’) prompts yields the strongest policy improvements.
Low entropy preference: Reasoning trajectories with lower average action entropy are more likely to reflect confident, high-quality outputs and should be prioritized in replay selection.
Mixed on-policy/off-policy training: Integrates exploration (fresh rollouts) with exploitation (archived high-value experience), employing corrections for distribution shift and policy drift.

2. Experience Management and Selection

The heart of ExGRPO is its experience management module, which operationalizes the notion of “valuable” learning experiences.

Experience scoring and bucketing:

Each trajectory is assigned an online correctness value, i.e., the fraction of correct completions for a particular query over time.
Prompts are bucketed into three broad categories: easy, medium, and hard, based on their rolling correctness statistics.
Sampling for replay is governed by a Gaussian weighting curve favoring medium-difficulty groups (accuracy ≈ 0.5), as these have empirically been shown to maximize learning signal.

Entropy-based selection:

For each sampled prompt, the candidate trajectory with the lowest mean token-level entropy, $H(o) = -\frac{1}{|o|} \sum_t \pi(o_t | q, o_{<t}) \log \pi(o_t | q, o_{<t})$ , is chosen.
This operationalizes the empirical finding that lower-entropy (i.e., more confident) rollouts are more reliable for policy improvement.

This composite selection mechanism avoids the “snowball” problem of accumulating low-quality or high-entropy rollouts in the buffer and guarantees that replayed experiences are maximally informative (Zhan et al., 2 Oct 2025).

3. Mixed-Policy Objective and Off-Policy Correction

ExGRPO’s learning signal combines on-policy exploration with exploitation of prioritized, off-policy stored rollouts. This is formalized in its mixed-policy optimization objective:

Batch formation: Each update operates over a mini-batch containing both fresh on-policy rollouts and prioritized experiential data, with the balance controlled by a replay ratio $\rho$ .
Advantage estimation: Within each prompt-group, the GRPO “whitened” (normalized) advantage is

$\widehat{A}_i = \frac{r(q, o_i) - \mu_{\mathcal{G}_q}}{\sigma_{\mathcal{G}_q}},$

where $r(q, o_i)$ is the verifiable reward, $\mu_{\mathcal{G}_q}$ and $\sigma_{\mathcal{G}_q}$ the within-group mean and standard deviation as in GRPO.

Off-policy correction: For rollouts generated under a historical policy $\pi_{past}$ , policy shift is corrected via per-token importance weighting:

$w_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{past}}(o_{i,t}|q, o_{i,<t})}.$

To avoid the instability associated with large importance ratios, these are transformed by a non-linear “policy shaping” function; typically $f(w) = w / (w + \beta)$ for small $\beta$ .

The loss blends on-policy advantage-weighted likelihood ratios with off-policy (replayed) corrections, all within the familiar GRPO clipped objective structure, ensuring both theoretical soundness and empirical robustness.

4. Empirical Performance and Benchmarking

Empirical studies across five base architectures (1.5B–8B parameters, covering Qwen and Llama variants) and a spectrum of mathematical/logic reasoning benchmarks demonstrate the practical impact of ExGRPO (Zhan et al., 2 Oct 2025):

Sample and data efficiency: Averaged gains of +3.5 (in-distribution) and +7.6 (out-of-distribution) test points over on-policy RLVR baselines are reported for tasks such as AIME, AMC, OlympiadBench, MATH-500, and Minerva.
Training stability: Weak base models (e.g., Llama-3.1 8B Base) that were prone to collapse under on-policy GRPO exhibited robust, monotonic improvement with ExGRPO.
Continual learning: In ongoing adaptation scenarios (e.g., LUFFY model), ExGRPO-trained models outperformed both simple on-policy runs and those relying solely on external data.
Ablation studies: Analysis confirms that correctness-based bucketing and low-entropy selection are critical for the observed performance improvements.

A representative outcome is that ExGRPO achieves better reasoning generalization on both in-distribution and held-out challenge sets while exhibiting reduced sample complexity.

5. Mathematical Formulation

The formal training objective in ExGRPO, blending on-policy and off-policy components, is given by

$\mathcal{J}_{\text{ExGRPO}}(\theta) = (1-\rho)\,\mathbb{E}_{q \sim \mathcal{B}_{\text{on}}}\left[\frac{1}{K} \sum_{i=1}^{K} \text{CLIP}\big(w_i(\theta), \widehat{A}(o_i, \mathcal{G}_q)\big)\right] + \rho\,\mathbb{E}_{q^* \sim \mathcal{B}_{\text{exp}}}\left[\frac{1}{K} \Big(\text{CLIP}\big(w^*(\theta), \widehat{A}(o^*, \mathcal{G}_{q^*})\big) + \sum_{i=1}^{K-1} \text{CLIP}\big(w_i(\theta), \widehat{A}(o_i, \mathcal{G}_{q^*})\big) \Big) \right],$

where

$w_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,<t})}.$

For replayed (“off-policy”) trajectories, a non-linear shaping $f(w)$ replaces direct clipping.

Key features:

Combined on-policy (fresh group) and off-policy (replay) samples.
Per-token importance corrections for policy drift.
Advantage normalization and group structuring per classical GRPO.

6. Extensions, Generalizations, and Future Perspective

The ExGRPO framework is architected for extensibility across domains:

Multimodal and agentic RL: The emphasis on experience selection and policy mixing is naturally applicable to multimodal tasks or decision-making agents requiring both exploration and exploitation of previously observed rare successes.
Generalization beyond reasoning LLMs: While the empirical focus is on mathematical reasoning, the experience-aware structure is general with anticipated utility for multimodal challenges, robotics, and interactive control, especially where verifiable rewards are sparse.
Potential integration: Further combining ExGRPO’s experience mechanisms with auxiliary credit assignment frameworks (e.g., process reward models or continuous control variants) is an open area, with promising results in promoting sample efficiency and generalization across distributions.
Theoretical analysis: The effect of experience buffer size, update schedule, and policy shaping on convergence and stability presents a set of open theoretical challenges with potential for optimization.

7. Significance in the Context of RLVR and Policy Optimization

ExGRPO exemplifies a paradigm shift in RLVR—leveraging experience characteristics (correctness, entropy) for sample prioritization and reuse, and harmonizing on-policy and off-policy learning. It builds directly on the statistical structure of GRPO, generalizing to a highly modular, data-efficient, and stable policy optimization toolkit (Zhan et al., 2 Oct 2025). By addressing mode collapse, reward sparsity, and learning instability, ExGRPO establishes a new standard for experiential reinforcement learning frameworks in reasoning and beyond.

This approach is positioned as a cornerstone for scalable and robust RL-based training in LLMs, with potential for broad application in any setting requiring integration of exploration and principled, value-driven exploitation of historical experience.

PDF Markdown Chat (Pro)

References (1)

ExGRPO: Learning to Reason from Experience (2025)

Follow Topic

Get notified by email when new papers are published related to ExGRPO Framework.