RLoop Iteration: Enhancing RL Cycles

Updated 29 November 2025

RLoop iteration is a cyclical reinforcement learning framework that alternates between trajectory exploration and expert-driven exploitation.
It employs rejection-sampling fine-tuning to update policies, achieving measurable accuracy gains and improved pass@N metrics per cycle.
The approach mitigates overfitting and catastrophic forgetting by focusing on hard, diverse trajectories to ensure robust policy generalization.

RLoop iteration refers to a self-improving, cyclical framework for reinforcement learning (RL) that alternates between trajectory exploration and expert-driven exploitation, designed to mitigate RL overfitting and catastrophic forgetting in large reasoning models. Each iteration initializes RL from a refined policy, collects diverse solution trajectories, filters successful outcomes, and executes Rejection-sampling Fine-Tuning (RFT) to update the policy, with empirical evidence showing robust generalization and accumulated performance gains relative to conventional RL (Zhiyuan et al., 6 Nov 2025).

1. Iterative Structure and Motivation

The RLoop framework is built upon iterative policy initialization, systematically addressing policy over-specialization and solution diversity loss associated with standard RL fine-tuning. A single RLoop iteration consists of two linked phases:

Exploration (RL phase): A policy $\pi^{(k)}$ is fine-tuned via on-policy RL methods (e.g., PPO, REINFORCE), generating a pool of trajectories $D_{RL}^{(k)}$ over $N_{RL}$ update steps.
Exploitation (RFT phase): Successful and “hard” problem trajectories are filtered into an expert set $D_{expert}^{(k)}$ , then used for supervised fine-tuning of a fresh policy copy via RFT, optionally regularized by a KL divergence penalty.

This cycle leverages transient policy diversity from exploration and converts it into durable performance improvements, contrasting standard RL which typically discards such intermediary variations.

2. Iteration Workflow and Pseudocode

A typical RLoop iteration is formally decomposed as follows:

Phase	Step	Key Output
RL (Exploration)	Sample & RL update	Diversity-rich pool $D_{RL}^{(k)}$
Filtering	Select successful, hard cases	Expert set $D_{expert}^{(k)}$
RFT (Exploitation)	Supervised update, KL reg.	New initial policy $\pi^{(k+1)}$

Detailed steps:

From $\pi^{(k)}$ , initialize RL fine-tuning. For $t$ from $1$ to $N_{RL}$ , sample batches of prompts, roll out trajectories $\{\tau_i\}$ , compute rewards $R(\tau_i)$ , and aggregate $D_{RL}^{(k)}$ .
Apply filter $F(\tau) = 1$ if $R(\tau) > 0$ , $0$ otherwise. Retain only trajectories from prompts with success rate below a “hard” threshold (e.g., $<10\%$ ).
Re-initialize policy parameters from the iteration start, then perform supervised fine-tuning over $D_{expert}^{(k)}$ for $E$ epochs, applying the RFT gradient:

$\nabla_\theta L_{RFT}(\theta) = \frac{1}{B}\sum_{\tau \in \text{batch}} \nabla_\theta \log \pi_\theta(\tau) - \lambda \nabla_\theta D_{KL}[\pi_\theta \| \pi^{(k)}]$

The updated parameters $\theta_{k+1}$ define $\pi^{(k+1)}$ for the subsequent iteration.

Algorithmic pseudocode:

θ ← θ₀
for k in 0…I−1 do
    D_RL ← ∅
    θ_RL ← θ
    for t in 1…N_RL do
        Sample batch {x_i}, generate {τ_i ~ π_{θ_RL}}
        Compute rewards R(τ_i)
        Store D_RL ← D_RL ∪ { (τ_i, R(τ_i)) }
        Compute ∇_θ J_RL(θ_RL)
        θ_RL ← θ_RL + lr_RL · ∇_θ J_RL(θ_RL)
    end for
    D_expert ← { τ ∈ D_RL : R(τ)>0 AND prompt_success_rate(τ.prompt) < hard_threshold }
    θ_new ← θ
    for epoch in 1…E do
        for minibatch in Batches of D_expert of size B do
            compute ∇_θ L_RFT(θ_new)
            θ_new ← θ_new + lr_RFT · ∇_θ L_RFT(θ_new)
        end for
    end for
    θ ← θ_new
end for
return π_{θ}

3. Mathematical Objectives

RLoop iteration formalizes both RL and RFT phases:

Trajectory Sampling (RL):

$J_{RL}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)]$

Gradients are computed via:

$\nabla_{\theta} J_{RL}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}}[A(\tau) \nabla_{\theta} \log \pi_\theta(\tau)],$

with $A(\tau)$ the advantage (e.g., $R(\tau) - b$ or PPO-variant).

Filter Function:

$F(\tau) = \begin{cases} 1, & R(\tau) > 0 \ 0, & \text{otherwise} \end{cases}$

$D_{expert}^{(k)} = \{\tau \in D_{RL}^{(k)} \mid F(\tau) = 1,\, \text{“hard”}\}$

RFT Objective:

$L_{RFT}(\theta) = - \mathbb{E}_{\tau \sim \pi_{\theta_{RL}}}\left[R(\tau)\log \pi_\theta(\tau)\right]$

Gradient update:

$\nabla_\theta L_{RFT}(\theta) = -\mathbb{E}_{\tau \sim D_{expert}}[\nabla_\theta \log \pi_\theta(\tau)] + \lambda \nabla_\theta D_{KL}[\pi_\theta \parallel \pi^{(k)}]$

4. Performance Accumulation and Empirical Findings

Evaluation employs validation accuracy $\text{Acc}(\pi^{(k)})$ and pass@ $N$ metrics, with metrics observed across iterations:

For $k = 0, ..., I-1$ ,

$\Delta \text{Acc}_k = \text{Acc}(\pi^{(k+1)}) - \text{Acc}(\pi^{(k)}) > 0$

$\text{Pass@N}(\pi^{(k+1)}) - \text{Pass@N}(\pi^{(k)}) > 0$

Gains demonstrate approximate linear accumulation until saturation. On math benchmarks, each RLoop iteration (200 RL steps, 1 epoch RFT) commonly yields $2$– $4\%$ accuracy gain and $3$– $5\%$ pass@32 gain. Over multiple cycles, average accuracy increases by $9\%$ and pass@32 by over $15\%$ versus baseline RL (Zhiyuan et al., 6 Nov 2025).

This suggests that iterative initialization and exploitation of inter-step policy diversity convert transient trajectory successes into generalizable policy improvements.

5. Implementation Considerations

Implementation of RLoop utilizes:

RL Algorithm: DAPO (PPO-like on-policy policy gradient), group size $16$, max token length $2048$.
Trajectory Budget: $N_{RL} = 200$ RL updates, batch size $B = 64$ (∼$12.8$K total trajectories/iteration).
Filtering: Accept only $R(\tau)>0$ and prompt success rate $<10\%$ for “hard” cases.
Learning Rates: $lr_{RL} = 10^{-6}$ (conservative, mitigates collapse), $lr_{RFT} = 10^{-5}$ (promotes supervised convergence).
KL Regularization: $\lambda = 0.1$ –$1.0$ to control divergence from initialization.

Shorter RL phases empirically reduce catastrophic forgetting, while strict filtering on “hard” prompts concentrates RFT on frontier tasks, accelerating convergence and preventing overfitting to simple cases. A plausible implication is that RLoop’s parameterization allows fine-grained control over exploration-exploitation tradeoff, tailoring generalization gains to the requirements of complex reasoning benchmarks.

6. Context and Significance

RLoop iteration addresses core challenges in reinforcement learning for verifiable rewards (RLVR) — notably, the tendency of large models to overfit to training rewards and degrade in generalization. By leveraging iterative expert sets and rejection-sampling, RLoop preserves diversity and converts latent policy improvements into robust starting points for future RL cycles. This framework is particularly significant for domains where “hard” problems are rare and trajectory diversity is essential for generalization.

The design principles underlying RLoop are extensible, suggesting applications to broader self-improving agent settings beyond mathematical reasoning tasks. Its schematic separation of exploration and exploitation phases, together with rigorous filtering and regularization, provides a methodological basis for future work on reinforcement learning with verifiable, sparse, or frontier-type rewards (Zhiyuan et al., 6 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to RLoop Iteration.