Iterative RL/SFT: Bridging Supervised & RL

Updated 14 October 2025

Iterative RL/SFT is a framework that alternates supervised fine-tuning with reinforcement learning to iteratively improve model alignment and performance across complex tasks.
It leverages techniques like SuperHF, inverse RL, and plug-and-play methods to address challenges such as catastrophic forgetting, reward hacking, and limited exploration.
By refining training signals through iterative label improvement and adaptive scheduling, this paradigm enhances stability, data efficiency, and model robustness.

Iterative Reinforcement Learning/SFT encompasses algorithmic frameworks and methodologies that alternate or dynamically combine supervised fine-tuning (SFT) with reinforcement learning (RL) or preference optimization, often in multiple rounds, to improve large (multimodal) LLM performance, alignment, and robustness across reasoning, safety, and capability domains. Recent research demonstrates a spectrum of designs—ranging from explicit alternating SFT and RL phases, adaptive interleaving, plug-and-play insertion of one paradigm into another, and meta-iterative defense–attack loops—each tailored for specific data regimes, supervision qualities, and downstream tasks. The iterative aspect is leveraged both to refine training signals (e.g., through label improvement or RL-driven trajectory generation) and to mitigate shortcomings such as overfitting, catastrophic forgetting, reward hacking, or limited exploration.

1. Core Mechanisms in Iterative RL/SFT

Iterative RL/SFT strategically alternates between supervised updating and reward-driven optimization, typically to combine their respective advantages and compensate for their individual limitations. In the canonical setting, SFT initializes the model on human-annotated demonstrations, ensuring fidelity to grounded responses, while RL (or a proxy such as preference optimization) exploits reward signals to push performance on complex, high-reward, or dynamically-defined objectives. Sophisticated variants extend this paradigm:

Supervised Iterative Learning from Human Feedback (SuperHF) iteratively generates candidate completions, filters them using a reward model, and retrains via SFT with a KL-divergence prior to prevent distributional collapse. This unifies the RLHF pipeline but stabilizes convergence and mitigates reward hacking through KL regularization and empirical surrogate posteriors (Mukobi et al., 2023).
Inverse Reinforcement Learning in SFT treats SFT as a bilevel problem, jointly learning a reward function and a policy maximizing regularized expected reward; the updates contrast expert demonstrations with synthetic model samples, unifying self-play and IRL approaches (Li et al., 28 May 2024).
Plug-and-Play Frameworks such as MIFO decouple RL and SFT intervals, storing challenging RL examples in a buffer to be used for SFT on uncertain/high-entropy tokens, while freezing RL-critical parameters to prevent catastrophic forgetting (Yuan et al., 6 Oct 2025).
Step-Adaptive Integration (SASR) guides the SFT–RL trade-off at each optimization step by monitoring gradient norms and KL divergence from initial data, setting the adaptive schedule for RL and SFT loss terms (Chen et al., 19 May 2025).

These frameworks can operate in (i) explicit, fixed-phase alternations; (ii) dynamically scheduled or interleaved updates; (iii) meta-iterative loops where outputs or weaknesses generated at each round refine the training signal for subsequent rounds.

A central innovation in modern iterative RL/SFT is the elevation of reward learning or data refinement as online or iterative processes, addressing problems of limited or unreliable supervision:

Iterative Label Refinement (ILR) replaces error-prone SFT training examples with high-quality model-generated alternatives selected through human (or simulated) preferences. The process iterates, cross-labeling across data splits, updating only a controlled fraction $\alpha$ of labels at each round (Ye et al., 14 Jan 2025). This method, compared to RLHF (e.g., DPO), is particularly robust to weak supervision, avoids overoptimization under noisy feedback, and permits large effective model updates via improved data.
Direct Preference Optimization: Methods such as DPO allow the policy to be updated in closed-form with respect to win–lose pairs, and are employed in step-wise iterative settings to calibrate tool-use segments, multi-modal generation, or hybrid visual-text tasks (Zeng et al., 15 Jan 2025, Zhuang et al., 3 Apr 2025).
Reward Models and Filtering: Reward models can be iteratively retrained or serve as filtering agents (as in SuperHF or iTool), defining surrogate training distributions, preventing mode collapse, and enabling interpretable trade-offs (e.g., between high reward and diversity measured via METEOR similarity or entropy penalties).

Empirical results indicate that iterative data refinement and reward-informed filtering perform favorably under unreliable supervision and lead to greater robustness and generalization in complex domains (e.g., mathematical reasoning, code synthesis, safe instruction following).

3. Stability, Forgetting, and Data Efficiency

Iterative RL/SFT often targets improved stability, minimal catastrophic forgetting, and efficiency with respect to both data and computation:

Catastrophic Forgetting is explicitly mitigated via entropy-aware loss calculation (focusing SFT on high-uncertainty tokens) and parameter freezing for weights most affected by RL updates (Yuan et al., 6 Oct 2025). This approach achieves state-of-the-art reasoning performance with only 1.5% of previous SFT and 20.4% of RL data usage.
Online vs. Offline Iterative RL: RoiRL demonstrates that weighted log-likelihood optimization using majority-vote self-labels in an offline iterative loop achieves greater stability and training speed than online RL methods, eliminating the need for a reference model and enabling up to 2.5× faster training (Arzhantsev et al., 3 Oct 2025).
Task/Knowledge Retention Trade-off: Research shows SFT accelerates rapid adaptation to novel tasks but may cause severe forgetting, while RFT enables slower but more stable task learning without interfering with prior knowledge. Periodic SFT on RFT-generated correct rollouts can combine rapid acquisition with preserved performance on previous tasks (Zhang et al., 30 Jun 2025).

The empirical and theoretical consensus is that data distribution and the update schedule, rather than sole algorithmic differences, play the critical role in balancing forgetting and sample efficiency.

4. Evaluation, Generalization, and Proxy Metrics

Rigorous evaluation protocols and proxy metrics for iterative RL/SFT pipeline tuning are increasingly necessary to ensure progress is both real and transferable to downstream RL or alignment goals:

SFT Score Limitations: High supervised fine-tuning scores alone do not reliably predict subsequent RL stage gains. Overfitting to homogeneous or short examples can lead to misleading SFT metrics and poor RL outcomes (Kang et al., 2 Oct 2025).
Alternative Predictors: Generalization loss on held-out validation (as it starts to rise) and Pass@ $k$ metrics on large batches (indicating the presence of at least one correct among many samples) are shown to be far better predictors of RL potential, improving R² correlations by up to 0.5 over naive SFT-based metrics (Kang et al., 2 Oct 2025).
Iterative Adversarial Training and Safety: In security-critical modes, such as the SecTOW framework, adversarial attacker-defender models are iteratively optimized using RL (GRPO) to identify and patch vulnerabilities in model reasoning while monitoring over-refusal and diversity via built-in quality metrics (Dai et al., 29 Jul 2025).

These findings underscore the importance of proxy metrics, ablation of training schedules, and careful selection of evaluation protocols as integral components of successful iterative RL/SFT deployment.

5. Applications and Extensions: Multimodal, Tool Use, and Domain-Specific Reasoning

Iterative RL/SFT methods have been extended and validated in a range of application contexts:

Vision-Language and Multimodal Models: Alternating SFT and RL phases, as in OpenVLThinker and Metis-RISE, lead to iterative self-improvement in multimodal reasoning tasks. SFT distills chain-of-thought traces (potentially from text-only expert models) while RL (typically using GRPO or mixed reward models) enhances adaptive reasoning and enables transfer across visual and textual domains (Deng et al., 21 Mar 2025, Qiu et al., 16 Jun 2025).
Tool Use and Code Generation: For advanced tool-driven LLM tasks (as in iTool and AutoTriton), step-wise MCTS and RL-driven preference optimization iteratively address local deficiencies and adapt to hard, fragmentary errors, yielding measurable improvements over SFT-only or purely synthetic-data approaches (Zeng et al., 15 Jan 2025, Li et al., 8 Jul 2025).
Legal and Domain-Specific Reasoning: Iterative SFT–RL pipelines with domain-specific rewards or assessor/reviser agents support robust, interpretable legal reasoning, outperforming much larger general-purpose models by combining LoRA SFT warmup with domain-informed RL reward structures (Cai et al., 11 Oct 2025).

Notably, applications systematically leverage the iterative framework to address context-specific deficits—be it complex multi-step tool use, safety via adversarial training, adaptive domain reasoning, or efficient alignment of multimodal models.

6. Theoretical Underpinnings and Mathematical Foundations

Several lines of work present theoretical analysis, convergence guarantees, and closed-form objectives for iterative RL/SFT algorithms:

Bilevel and Minimax Formulations: The IRL-cast SFT exemplifies a bilevel framework where the lower-level RL problem admits a closed-form softmax policy, and the upper-level SFT maximizes demonstration log-likelihood subject to regularized policy alignment (Li et al., 28 May 2024).
Learning Dynamics: The effect of fine-tuning updates on the probability landscape (and catastrophic forgetting) is analytically explored using the empirical neural tangent kernel, revealing that reinforcement on on-distribution rollouts preserves prior knowledge via symmetric gradient interactions (Zhang et al., 30 Jun 2025).
Gradient Regularization, KL, and Adaptive Losses: Theoretical results connect gradient norm with divergence from initial SFT distribution, justifying adaptive schedule designs (as in SASR) and the use of KL penalties to stabilize policy updates and prevent overfitting or collapse (Chen et al., 19 May 2025, Mukobi et al., 2023).

These mathematical foundations inform practical algorithm design, providing convergence guarantees, efficiency justifications, and insight into the trade-offs between exploration (RL-driven) and exploitation (SFT-driven).

7. Trends, Future Directions, and Open Challenges

The iterative RL/SFT paradigm continues to evolve, and several key trajectories are visible:

Cyclic Multi-Round Training: Forward iterations alternating SFT and RL (potentially in an adaptive or performance-triggered schedule) are proposed to consolidate gains, reactivate reasoning, and enable continual improvement, especially in large-scale domain- or multimodal models (Qiu et al., 16 Jun 2025).
Data-Centric Refinement: Emphasis is shifting towards iterative dataset improvement (ILR, data cleaning via feedback) as an alternative or supplement to direct iterative model optimization, with special benefits under unreliable or noisy supervision (Ye et al., 14 Jan 2025).
Algorithm Agnosticism and Meta-Frameworks: Modular, plug-and-play architectures allow for domain, data type, and supervision noise adaptation, decoupling the iterative pipeline from any one RL or SFT algorithm (Yuan et al., 6 Oct 2025).
Quality Control and Interventional Monitoring: Systematic use of intermediate validation loss, diversity/entropy measures, and task-appropriate adversarial challenges are emerging as best practices for controlling runaway optimization, catastrophic forgetting, or overrefusal in sensitive applications (Dai et al., 29 Jul 2025, Kang et al., 2 Oct 2025).

A plausible implication is that iterative RL/SFT will increasingly serve as a foundational protocol for continual model improvement, robust safety assurance, and unified alignment across emergent domains, as scaling, data availability, and real-world complexity intensify.