RePO: Replay-Enhanced Policy Optimization

Published 11 Jun 2025 in cs.CL, cs.AI, and cs.LG | (2506.09340v1)

Abstract: Reinforcement learning (RL) is vital for optimizing LLMs. Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces RePO, which integrates off-policy replay data to enhance policy optimization in reinforcement learning for large language models.
Experiments on benchmarks like GSM8K and AMC23 show that RePO consistently outperforms GRPO, delivering notable accuracy gains.
RePO boosts data efficiency by increasing effective optimization steps by 48% with only a 15% rise in computational overhead.

Replay-Enhanced Policy Optimization: An Essay

Introduction

The paper "RePO: Replay-Enhanced Policy Optimization" introduces a novel approach in reinforcement learning (RL) to enhance the performance of LLMs on complex reasoning tasks. The Replay-Enhanced Policy Optimization (RePO) methodology aims to overcome the limitations of Group Relative Policy Optimization (GRPO) by incorporating off-policy data to improve data efficiency and diversity in training samples. This essay will explore the theoretical underpinnings, implementation details, and empirical evaluations presented in the paper.

Methods

The core idea of RePO is to extend GRPO by leveraging a replay buffer to integrate off-policy data into the RL framework. Unlike GRPO, which solely relies on on-policy samples for advantage estimation, RePO utilizes previously sampled outputs to enrich the data diversity. The replay buffer plays a critical role by storing outputs from past iterations, allowing the model to optimize its policy using a combination of on-policy and off-policy samples.

Figure 1: Demonstration of RePO. The policy is updated using both on-policy samples and off-policy samples retrieved from a replay buffer. Advantage estimation is performed separately for on-policy and off-policy updates.

The RePO objective function is formulated by combining both on-policy and off-policy terms, which allows for a more extensive set of outputs per prompt and better data efficiency. By integrating different replay strategies, such as recency-based or reward-oriented strategies, RePO can selectively retrieve samples that maximize the policy's learning potential.

Experiments and Results

RePO was evaluated against GRPO across multiple open-source LLMs, specifically targeting mathematical reasoning benchmarks such as GSM8K, Minerva, and AMC23. The experimental results demonstrate that RePO consistently outperforms GRPO, achieving significant absolute average performance gains across different model scales and tasks (Figure 2).

Figure 2: Comparison of average accuracy between GRPO and RePO under varying numbers of off-policy (Replay) samples across seven math reasoning benchmarks: GSM8K, MATH-500, Olympiad, Minerva, AIME24, AIME25, and AMC23, using Qwen2.5-Math-1.5B (left) and Qwen3-1.7B (right) with 8 on-policy samples.

In addition to performance improvement, RePO offers a computationally efficient solution. For instance, when applied to Qwen3-1.7B, RePO increased computational overhead modestly by 15% but significantly raised the number of effective optimization steps by 48%. This demonstrates not only its effectiveness but also its efficiency relative to the computational resources required.

Analysis and Discussion

Several factors contribute to the success of RePO. The diversity in replay strategies ensures that the most informative samples, particularly those aligning closely with the current policy or yielding high rewards, are prioritized. This strategic sampling mitigates common RL issues such as vanishing gradients, which are prevalent in models solely relying on on-policy data.

Furthermore, RePO's robust advantage estimation approach, which splits the calculation for on-policy and off-policy experiences, minimizes potential interference.

The paper also identifies limitations related to RePO's implementation on larger models and the exploration of hyperparameters. These limitations point to opportunities for further research, potentially broadening RePO's applicability and effectiveness.

Conclusion

The proposed Replay-Enhanced Policy Optimization significantly advances the RL optimization framework by addressing GRPO's limitations through the innovative use of off-policy samples. RePO effectively enhances the reasoning capabilities of LLMs, validating its potential for improving data efficiency and reducing overfitting. As LLMs continue to evolve, methods like RePO will be instrumental in driving future innovations in RL optimization.

Markdown Report Issue