RLoop Iteration: Enhancing RL Cycles
- RLoop iteration is a cyclical reinforcement learning framework that alternates between trajectory exploration and expert-driven exploitation.
- It employs rejection-sampling fine-tuning to update policies, achieving measurable accuracy gains and improved pass@N metrics per cycle.
- The approach mitigates overfitting and catastrophic forgetting by focusing on hard, diverse trajectories to ensure robust policy generalization.
RLoop iteration refers to a self-improving, cyclical framework for reinforcement learning (RL) that alternates between trajectory exploration and expert-driven exploitation, designed to mitigate RL overfitting and catastrophic forgetting in large reasoning models. Each iteration initializes RL from a refined policy, collects diverse solution trajectories, filters successful outcomes, and executes Rejection-sampling Fine-Tuning (RFT) to update the policy, with empirical evidence showing robust generalization and accumulated performance gains relative to conventional RL (Zhiyuan et al., 6 Nov 2025).
1. Iterative Structure and Motivation
The RLoop framework is built upon iterative policy initialization, systematically addressing policy over-specialization and solution diversity loss associated with standard RL fine-tuning. A single RLoop iteration consists of two linked phases:
- Exploration (RL phase): A policy is fine-tuned via on-policy RL methods (e.g., PPO, REINFORCE), generating a pool of trajectories over update steps.
- Exploitation (RFT phase): Successful and “hard” problem trajectories are filtered into an expert set , then used for supervised fine-tuning of a fresh policy copy via RFT, optionally regularized by a KL divergence penalty.
This cycle leverages transient policy diversity from exploration and converts it into durable performance improvements, contrasting standard RL which typically discards such intermediary variations.
2. Iteration Workflow and Pseudocode
A typical RLoop iteration is formally decomposed as follows:
| Phase | Step | Key Output |
|---|---|---|
| RL (Exploration) | Sample & RL update | Diversity-rich pool |
| Filtering | Select successful, hard cases | Expert set |
| RFT (Exploitation) | Supervised update, KL reg. | New initial policy |
Detailed steps:
- From , initialize RL fine-tuning. For from $1$ to , sample batches of prompts, roll out trajectories , compute rewards , and aggregate .
- Apply filter if , $0$ otherwise. Retain only trajectories from prompts with success rate below a “hard” threshold (e.g., ).
- Re-initialize policy parameters from the iteration start, then perform supervised fine-tuning over for epochs, applying the RFT gradient:
The updated parameters define for the subsequent iteration.
Algorithmic pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
θ ← θ₀ for k in 0…I−1 do D_RL ← ∅ θ_RL ← θ for t in 1…N_RL do Sample batch {x_i}, generate {τ_i ~ π_{θ_RL}} Compute rewards R(τ_i) Store D_RL ← D_RL ∪ { (τ_i, R(τ_i)) } Compute ∇_θ J_RL(θ_RL) θ_RL ← θ_RL + lr_RL · ∇_θ J_RL(θ_RL) end for D_expert ← { τ ∈ D_RL : R(τ)>0 AND prompt_success_rate(τ.prompt) < hard_threshold } θ_new ← θ for epoch in 1…E do for minibatch in Batches of D_expert of size B do compute ∇_θ L_RFT(θ_new) θ_new ← θ_new + lr_RFT · ∇_θ L_RFT(θ_new) end for end for θ ← θ_new end for return π_{θ} |
3. Mathematical Objectives
RLoop iteration formalizes both RL and RFT phases:
- Trajectory Sampling (RL):
Gradients are computed via:
with the advantage (e.g., or PPO-variant).
- Filter Function:
- RFT Objective:
Gradient update:
4. Performance Accumulation and Empirical Findings
Evaluation employs validation accuracy and pass@ metrics, with metrics observed across iterations:
- For ,
Gains demonstrate approximate linear accumulation until saturation. On math benchmarks, each RLoop iteration (200 RL steps, 1 epoch RFT) commonly yields $2$– accuracy gain and $3$– pass@32 gain. Over multiple cycles, average accuracy increases by and pass@32 by over versus baseline RL (Zhiyuan et al., 6 Nov 2025).
This suggests that iterative initialization and exploitation of inter-step policy diversity convert transient trajectory successes into generalizable policy improvements.
5. Implementation Considerations
Implementation of RLoop utilizes:
- RL Algorithm: DAPO (PPO-like on-policy policy gradient), group size $16$, max token length $2048$.
- Trajectory Budget: RL updates, batch size (∼$12.8$K total trajectories/iteration).
- Filtering: Accept only and prompt success rate for “hard” cases.
- Learning Rates: (conservative, mitigates collapse), (promotes supervised convergence).
- KL Regularization: –$1.0$ to control divergence from initialization.
Shorter RL phases empirically reduce catastrophic forgetting, while strict filtering on “hard” prompts concentrates RFT on frontier tasks, accelerating convergence and preventing overfitting to simple cases. A plausible implication is that RLoop’s parameterization allows fine-grained control over exploration-exploitation tradeoff, tailoring generalization gains to the requirements of complex reasoning benchmarks.
6. Context and Significance
RLoop iteration addresses core challenges in reinforcement learning for verifiable rewards (RLVR) — notably, the tendency of large models to overfit to training rewards and degrade in generalization. By leveraging iterative expert sets and rejection-sampling, RLoop preserves diversity and converts latent policy improvements into robust starting points for future RL cycles. This framework is particularly significant for domains where “hard” problems are rare and trajectory diversity is essential for generalization.
The design principles underlying RLoop are extensible, suggesting applications to broader self-improving agent settings beyond mathematical reasoning tasks. Its schematic separation of exploration and exploitation phases, together with rigorous filtering and regularization, provides a methodological basis for future work on reinforcement learning with verifiable, sparse, or frontier-type rewards (Zhiyuan et al., 6 Nov 2025).