Papers
Topics
Authors
Recent
2000 character limit reached

RLoop Iteration: Enhancing RL Cycles

Updated 29 November 2025
  • RLoop iteration is a cyclical reinforcement learning framework that alternates between trajectory exploration and expert-driven exploitation.
  • It employs rejection-sampling fine-tuning to update policies, achieving measurable accuracy gains and improved pass@N metrics per cycle.
  • The approach mitigates overfitting and catastrophic forgetting by focusing on hard, diverse trajectories to ensure robust policy generalization.

RLoop iteration refers to a self-improving, cyclical framework for reinforcement learning (RL) that alternates between trajectory exploration and expert-driven exploitation, designed to mitigate RL overfitting and catastrophic forgetting in large reasoning models. Each iteration initializes RL from a refined policy, collects diverse solution trajectories, filters successful outcomes, and executes Rejection-sampling Fine-Tuning (RFT) to update the policy, with empirical evidence showing robust generalization and accumulated performance gains relative to conventional RL (Zhiyuan et al., 6 Nov 2025).

1. Iterative Structure and Motivation

The RLoop framework is built upon iterative policy initialization, systematically addressing policy over-specialization and solution diversity loss associated with standard RL fine-tuning. A single RLoop iteration consists of two linked phases:

  • Exploration (RL phase): A policy π(k)\pi^{(k)} is fine-tuned via on-policy RL methods (e.g., PPO, REINFORCE), generating a pool of trajectories DRL(k)D_{RL}^{(k)} over NRLN_{RL} update steps.
  • Exploitation (RFT phase): Successful and “hard” problem trajectories are filtered into an expert set Dexpert(k)D_{expert}^{(k)}, then used for supervised fine-tuning of a fresh policy copy via RFT, optionally regularized by a KL divergence penalty.

This cycle leverages transient policy diversity from exploration and converts it into durable performance improvements, contrasting standard RL which typically discards such intermediary variations.

2. Iteration Workflow and Pseudocode

A typical RLoop iteration is formally decomposed as follows:

Phase Step Key Output
RL (Exploration) Sample & RL update Diversity-rich pool DRL(k)D_{RL}^{(k)}
Filtering Select successful, hard cases Expert set Dexpert(k)D_{expert}^{(k)}
RFT (Exploitation) Supervised update, KL reg. New initial policy π(k+1)\pi^{(k+1)}

Detailed steps:

  1. From π(k)\pi^{(k)}, initialize RL fine-tuning. For tt from $1$ to NRLN_{RL}, sample batches of prompts, roll out trajectories {τi}\{\tau_i\}, compute rewards R(τi)R(\tau_i), and aggregate DRL(k)D_{RL}^{(k)}.
  2. Apply filter F(τ)=1F(\tau) = 1 if R(τ)>0R(\tau) > 0, $0$ otherwise. Retain only trajectories from prompts with success rate below a “hard” threshold (e.g., <10%<10\%).
  3. Re-initialize policy parameters from the iteration start, then perform supervised fine-tuning over Dexpert(k)D_{expert}^{(k)} for EE epochs, applying the RFT gradient:

θLRFT(θ)=1Bτbatchθlogπθ(τ)λθDKL[πθπ(k)]\nabla_\theta L_{RFT}(\theta) = \frac{1}{B}\sum_{\tau \in \text{batch}} \nabla_\theta \log \pi_\theta(\tau) - \lambda \nabla_\theta D_{KL}[\pi_\theta \| \pi^{(k)}]

The updated parameters θk+1\theta_{k+1} define π(k+1)\pi^{(k+1)} for the subsequent iteration.

Algorithmic pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
θ  θ
for k in 0I1 do
    D_RL  
    θ_RL  θ
    for t in 1N_RL do
        Sample batch {x_i}, generate {τ_i ~ π_{θ_RL}}
        Compute rewards R(τ_i)
        Store D_RL  D_RL  { (τ_i, R(τ_i)) }
        Compute _θ J_RL(θ_RL)
        θ_RL  θ_RL + lr_RL · _θ J_RL(θ_RL)
    end for
    D_expert  { τ  D_RL : R(τ)>0 AND prompt_success_rate(τ.prompt) < hard_threshold }
    θ_new  θ
    for epoch in 1E do
        for minibatch in Batches of D_expert of size B do
            compute _θ L_RFT(θ_new)
            θ_new  θ_new + lr_RFT · _θ L_RFT(θ_new)
        end for
    end for
    θ  θ_new
end for
return π_{θ}

3. Mathematical Objectives

RLoop iteration formalizes both RL and RFT phases:

  • Trajectory Sampling (RL):

JRL(θ)=Eτπθ[R(τ)]J_{RL}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)]

Gradients are computed via:

θJRL(θ)=Eτπθ[A(τ)θlogπθ(τ)],\nabla_{\theta} J_{RL}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}}[A(\tau) \nabla_{\theta} \log \pi_\theta(\tau)],

with A(τ)A(\tau) the advantage (e.g., R(τ)bR(\tau) - b or PPO-variant).

  • Filter Function:

F(τ)={1,R(τ)>0 0,otherwiseF(\tau) = \begin{cases} 1, & R(\tau) > 0 \ 0, & \text{otherwise} \end{cases}

Dexpert(k)={τDRL(k)F(τ)=1,“hard”}D_{expert}^{(k)} = \{\tau \in D_{RL}^{(k)} \mid F(\tau) = 1,\, \text{“hard”}\}

  • RFT Objective:

LRFT(θ)=EτπθRL[R(τ)logπθ(τ)]L_{RFT}(\theta) = - \mathbb{E}_{\tau \sim \pi_{\theta_{RL}}}\left[R(\tau)\log \pi_\theta(\tau)\right]

Gradient update:

θLRFT(θ)=EτDexpert[θlogπθ(τ)]+λθDKL[πθπ(k)]\nabla_\theta L_{RFT}(\theta) = -\mathbb{E}_{\tau \sim D_{expert}}[\nabla_\theta \log \pi_\theta(\tau)] + \lambda \nabla_\theta D_{KL}[\pi_\theta \parallel \pi^{(k)}]

4. Performance Accumulation and Empirical Findings

Evaluation employs validation accuracy Acc(π(k))\text{Acc}(\pi^{(k)}) and pass@NN metrics, with metrics observed across iterations:

  • For k=0,...,I1k = 0, ..., I-1,

ΔAcck=Acc(π(k+1))Acc(π(k))>0\Delta \text{Acc}_k = \text{Acc}(\pi^{(k+1)}) - \text{Acc}(\pi^{(k)}) > 0

Pass@N(π(k+1))Pass@N(π(k))>0\text{Pass@N}(\pi^{(k+1)}) - \text{Pass@N}(\pi^{(k)}) > 0

Gains demonstrate approximate linear accumulation until saturation. On math benchmarks, each RLoop iteration (200 RL steps, 1 epoch RFT) commonly yields $2$–4%4\% accuracy gain and $3$–5%5\% pass@32 gain. Over multiple cycles, average accuracy increases by 9%9\% and pass@32 by over 15%15\% versus baseline RL (Zhiyuan et al., 6 Nov 2025).

This suggests that iterative initialization and exploitation of inter-step policy diversity convert transient trajectory successes into generalizable policy improvements.

5. Implementation Considerations

Implementation of RLoop utilizes:

  • RL Algorithm: DAPO (PPO-like on-policy policy gradient), group size $16$, max token length $2048$.
  • Trajectory Budget: NRL=200N_{RL} = 200 RL updates, batch size B=64B = 64 (∼$12.8$K total trajectories/iteration).
  • Filtering: Accept only R(τ)>0R(\tau)>0 and prompt success rate <10%<10\% for “hard” cases.
  • Learning Rates: lrRL=106lr_{RL} = 10^{-6} (conservative, mitigates collapse), lrRFT=105lr_{RFT} = 10^{-5} (promotes supervised convergence).
  • KL Regularization: λ=0.1\lambda = 0.1–$1.0$ to control divergence from initialization.

Shorter RL phases empirically reduce catastrophic forgetting, while strict filtering on “hard” prompts concentrates RFT on frontier tasks, accelerating convergence and preventing overfitting to simple cases. A plausible implication is that RLoop’s parameterization allows fine-grained control over exploration-exploitation tradeoff, tailoring generalization gains to the requirements of complex reasoning benchmarks.

6. Context and Significance

RLoop iteration addresses core challenges in reinforcement learning for verifiable rewards (RLVR) — notably, the tendency of large models to overfit to training rewards and degrade in generalization. By leveraging iterative expert sets and rejection-sampling, RLoop preserves diversity and converts latent policy improvements into robust starting points for future RL cycles. This framework is particularly significant for domains where “hard” problems are rare and trajectory diversity is essential for generalization.

The design principles underlying RLoop are extensible, suggesting applications to broader self-improving agent settings beyond mathematical reasoning tasks. Its schematic separation of exploration and exploitation phases, together with rigorous filtering and regularization, provides a methodological basis for future work on reinforcement learning with verifiable, sparse, or frontier-type rewards (Zhiyuan et al., 6 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to RLoop Iteration.