Replay-Enhanced PPO for LLM Fine-Tuning

Updated 22 June 2026

The paper introduces RePO, integrating off-policy replay with on-policy PPO to boost learning efficiency.
It employs replay buffer strategies with robust advantage estimation and importance weighting for stable LLM fine-tuning.
Empirical results highlight enhanced sample efficiency and performance gains, especially on challenging long-horizon reasoning tasks.

Replay-Enhanced Proximal Policy Optimization (RePO) is a reinforcement learning (RL) framework for fine-tuning LLMs that integrates off-policy experience replay into the standard on-policy PPO/GRPO paradigm. RePO was introduced to address the sample inefficiency and high computational cost that arise from the demand for multiple on-policy rollouts per prompt, enabling stable and data-efficient policy optimization by combining replay buffer techniques with robust advantage estimation and importance weighting. This approach significantly improves optimization step efficiency and final LLM performance, especially on complex reasoning tasks involving long-horizon credit assignment and exploration.

1. Motivation and Background

On-policy RL methods such as Proximal Policy Optimization (PPO) and its variant Group Relative Policy Optimization (GRPO) are widely adopted for aligning LLMs with human preferences or specialized benchmarks. These approaches require collecting fresh trajectories from the current policy for each update, leading to high computational overhead due to the autoregressive nature of LLM decoding and frequent reward collapse (zero-advantage minibatches) when model outputs homogenize. GRPO partially alleviates variance issues by computing group-normalized advantages over multiple rollouts per prompt, but still incurs substantial cost.

Replay-Enhanced PPO (RePO) is motivated by the observation that many valuable trajectories are collected in early training or in previous policy states, and standard PPO discards these entirely once the underlying policy drifts, resulting in wasted compute and slow progress. By reusing off-policy rollouts from a replay buffer with principled weighting and selective replay strategies, RePO seeks to break this inefficient cycle and accelerate learning while maintaining policy stability (Li et al., 11 Jun 2025).

2. Algorithmic Framework

RePO augments the standard PPO/GRPO update cycle with a replay buffer $\mathcal B$ that stores previously collected trajectories and their original behavior policy probabilities. Each training iteration recomputes the PPO surrogate objective using a mixture of freshly sampled (on-policy) and buffer-retrieved (off-policy) rollouts, both subjected to importance weighting and ratio clipping.

Given a prompt $q$ , the current policy $\pi_\theta$ , and $G_\mathrm{on}$ on-policy/ $G_\mathrm{off}$ off-policy samples, the total RePO objective is

$\mathcal{J}_\mathrm{RePO}(\theta;S) = \mathcal{J}_\mathrm{on}(\theta) + \mathcal{J}_\mathrm{off}(\theta;S),$

where each term uses the standard GRPO-style PPO clipping: $\mathcal{J}_\mathrm{on} = \mathbb{E}_{q}\left[ \frac{1}{G_\mathrm{on}}\sum_{i=1}^{G_\mathrm{on}} \frac{1}{|o_i^\mathrm{on}|}\sum_{t=1}^{|o_i^\mathrm{on}|} \min(r_{i,t}^\mathrm{on} A_{i,t}^\mathrm{on}, \operatorname{clip}(r_{i,t}^\mathrm{on},1-\epsilon,1+\epsilon)A_{i,t}^\mathrm{on}) \right].$ A symmetric term is used for $\mathcal{J}_\mathrm{off}$ . Advantage estimation is performed groupwise: the group-normalized advantage for each token is

$A_{i,t} = \frac{R(o_i) - \operatorname{mean}(G)}{\operatorname{std}(G)}$

where $G$ is the reward group. Off-policy samples, retrieved according to a replay strategy $q$ 0, are corrected by the ratio of current to behavior policy probabilities.

RePO alternates between on-policy and off-policy updates, then stores new rollouts in $q$ 1 for future reuse (Li et al., 11 Jun 2025).

3. Replay Buffer Strategies and Advantage Estimation

Efficient and effective utilization of the replay buffer is central to RePO’s performance. Four strategies for selecting off-policy samples are implemented:

Full-scope: Use all past rollouts for each prompt.
Recency-based: Sample the $q$ 2 most recent rollouts.
Reward-oriented: Retrieve the top- $q$ 3 rollouts by reward.
Variance-driven: Choose $q$ 4 rollouts with the highest sample reward variance.

Reward-oriented and recency-based replay yield the best empirical gains, while the split-advantage estimation—computing normalization within on-policy and replayed groups separately—outperforms mixing all samples together, preserving the benefit of diversified learning signals.

Importance weighting is achieved via the standard policy ratio,

$q$ 5

and clipped within $q$ 6. Off-policy samples with extremely low likelihood under the current policy are thus downweighted, but not discarded, which stabilizes updates.

4. Empirical Results and Computational Cost

RePO delivers substantial improvements in both sample efficiency and benchmark performance across mathematical reasoning tasks and LLM architectures (Li et al., 11 Jun 2025). Experimental highlights include:

Model	GRPO Acc.	RePO Acc.	$q$ 7 (pts.)
Qwen2.5-Math-1.5B	17.4	35.8	+18.4
Qwen2.5-Math-7B	47.0	49.0	+2.0
Qwen3-1.7B	39.5	43.6	+4.1

In terms of training dynamics, RePO increases effective optimization steps (steps with nonzero advantage) by $q$ 8 (31.2% $q$ 9 46.1%) compared to GRPO. This reflects the broader and more informative gradient updates achieved via replay, particularly when rewards collapse in the current minibatch.

The computational overhead of including off-policy gradient computation is modest ( $\pi_\theta$ 0 relative to GRPO baseline with equal batch sizes, i.e., $\pi_\theta$ 1 vs. $\pi_\theta$ 2 normalized time on Qwen3-1.7B). Optimal off-policy batch size is typically $\pi_\theta$ 3; excess off-policy samples increase noise.

RePO occupies an intermediate point between purely on-policy PPO/GRPO and more advanced variance-regularized methods such as $\pi_\theta$ 4VPO. Unlike hard policy ratio clipping, which discard all high-divergence samples, RePO retains off-policy data when available and downweights according to the clipped ratio.

$\pi_\theta$ 5VPO, or Ratio-Variance Regularized Policy Optimization (Luo et al., 6 Jan 2026), generalizes and smooths the trust-region constraint by penalizing the policy ratio variance with a quadratic Lagrangian penalty. This can be viewed as a principled extension of the replay mechanism in RePO, allowing unlimited off-policy reuse modulated by a dynamic quadratic penalty. Empirically, $\pi_\theta$ 6VPO matches or surpasses RePO in sample efficiency and asymptotic accuracy, consistently achieving 15–20% relative gains over clipping-based PPO and converging in roughly half as many rollouts.

Other contemporaneous replay-enhanced approaches include Retrospective Replay-based RL (RRL) (Dou et al., 19 Apr 2025), which dynamically caches high-value intermediate states, and EFRame (Wang et al., 27 Jun 2025), which integrates exploration, filtering, and replay buffers with prioritized sampling and importance weighting. These explorations highlight the broader trend toward integrating experience replay and robust off-policy correction in LLM RL.

6. Limitations and Future Directions

Key limitations of RePO include its static replay hyperparameters (with no adaptive or learning-based buffer management) and an experimental focus on models up to 7B parameters. The use of hard ratio clipping for off-policy correction, while stabilizing, can create “dead zones” for informative but rare trajectory gradients, especially in long-horizon reasoning. A plausible implication is that future variants should integrate variance-based soft regularization, prioritized sampling, and adaptive replay coefficients.

Proposed future directions involve extending RePO to larger LLMs, optimizing replay-weight coefficients, and incorporating techniques from variance-regularized optimization for more seamless blending of trust region enforcement and off-policy benefit.

7. Practical Implementation and Hyperparameters

Key implementation details for RePO (Li et al., 11 Jun 2025):

On-policy/off-policy per prompt: $\pi_\theta$ 7, $\pi_\theta$ 8.
Buffer: all on-policy rollouts per prompt, unbounded size.
Optimization: learning rate $\pi_\theta$ 9, cosine annealing, batch size 32 prompts, maximum 1024 completion tokens.
Clipping threshold: $G_\mathrm{on}$ 0.
Replay start: after $G_\mathrm{on}$ 1 epochs.
Hardware: 8 A100 GPUs (split for sampling and optimization).
Effective sampling: Split-advantage estimation, reward- or recency-based replay sampling.

Practical guidelines stress the need to balance on- and off-policy samples, maintain split-normalization of advantages, and avoid excessive replay-induced noise.

Replay-Enhanced PPO (RePO) represents a pivotal advance in LLM fine-tuning, leveraging efficient off-policy data reuse and robust advantage estimation to surpass the limitations of purely on-policy PPO/GRPO. It establishes a foundation for next-generation policy optimization methods that blend the statistical efficiency of replay with theoretically grounded trust region constraints and dynamic weighting (Li et al., 11 Jun 2025, Luo et al., 6 Jan 2026).