Cooperative SFT & RL for Model Alignment

Updated 2 March 2026

Cooperative SFT and RL is an innovative framework that integrates supervised fine-tuning with reinforcement learning to enhance model generalization and robustness.
It leverages bilevel optimization, meta-gradients, and adaptive scheduling to mitigate issues like gradient interference and catastrophic forgetting.
Empirical studies demonstrate significant gains in sample efficiency and reasoning performance across diverse language and multimodal tasks.

Cooperative Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is an advanced paradigm for aligning large models, especially language and multimodal systems, by leveraging their complementary optimization dynamics. Instead of relying on a rigid two-stage SFT→RL dichotomy, recent research proposes frameworks in which SFT and RL interact—sometimes adaptively, sometimes via explicit meta-optimization—to improve generalization, sample efficiency, and robustness, and to address limitations intrinsic to each methodology when used in isolation. This article synthesizes recent algorithmic, theoretical, and empirical developments in cooperative SFT and RL.

1. Theoretical Motivation: From Sequential to Cooperative Paradigms

Traditional post-training pipelines decouple SFT (typically minimizing next-token loss on curated demonstrations) and RL (usually policy-gradient updates on a scalar reward from a learned or rule-based proxy). However, isolation of these stages often leads to suboptimal interactions or even negative transfer, such as catastrophic forgetting or ineffective exploration. Recent works demonstrate that:

SFT can overfit or destroy model entropy, impairing subsequent RL exploration or generalization (Jin et al., 8 Sep 2025, Matsutani et al., 25 Sep 2025).
RL, when naively run after SFT, is often inefficient, especially in sparse reward settings, and can erase beneficial SFT-induced behaviors (Chen et al., 8 Sep 2025, Zhang et al., 1 Feb 2026).
The distribution mismatch between supervised data (behavior policy) and on-policy RL trajectories (target policy) diminishes the effectiveness of naive SFT as a warm-start for RL unless the objectives are explicitly linked (Zhang et al., 1 Feb 2026, Liu et al., 1 Jun 2025).

Cooperative frameworks explicitly tie SFT and RL—via shared parameter trajectories, bilevel optimization, or meta-losses—to ensure information flows bidirectionally and to optimize for effectiveness, efficiency, and robustness.

2. Explicit Cooperative Algorithms: Bilevel, Meta-Gradients, and Data Arbitration

A central development is the use of bilevel or meta-gradient formulations. In (Chen et al., 8 Sep 2025), the SFT objective is conditioned on the optimal RL policy: $\max_{w}\;\mathbb{E}_{(x,r,y)\sim\mathcal{D}_{\mathrm{SFT}} \left[\log\,\pi_{(\theta^*(w),w)}(r,y\mid x)\right] \quad \text{s.t.} \quad \theta^*(w) = \arg\max_{\theta} J_{\mathrm{RL}}(\theta, w),$ with “cooperative gain” meta-gradients explicitly maximizing the difference between joint SFT+RL and RL-only.

Data arbitration via internal signals further enhances cooperation. PRISM (Zhao et al., 12 Jan 2026) partitions training data according to gradient concentration metrics (e.g., Gini coefficient or kurtosis of layerwise parameter gradients under a frozen base model). High-conflict (gradient-concentrated) examples require RL restructuring, while diffuse ones are consolidated by SFT, yielding substantial improvements in sample efficiency and final reward (e.g., +5.46% absolute Success Rate over SFT→RL on WebShop, with up to 3.22× faster training).

Inverse Reinforcement Learning-inspired formulations (Li et al., 2024) unify SFT and RL into a maximum-entropy IRL objective, where human demonstration data simultaneously shapes the reward and the policy model. Notably, the closed-form solution enforces $\pi_\phi(y|x) \propto \pi_{\text{ref}}(y|x) \exp(r_\theta(x, y)/\beta)$ , blending supervised imitation and reward-based update in the same gradient flow.

3. Adaptive Scheduling, Dynamic Gating, and Curriculum Mechanisms

Static switching between SFT and RL fails to adapt to task difficulty, training progression, or reward sparsity. State-of-the-art adaptive schemes include:

SASR (Chen et al., 19 May 2025): Step-wise Adaptive SFT+RL integrates RL and SFT with a dynamic mixture coefficient $\alpha_t$ , modulated by the supervised gradient norm and KL divergence to the starting policy. SASR outperforms both static hybrids and pure SFT/RL across reasoning benchmarks, achieving GSM8K accuracy of 80.3% versus 75.2% for SFT.
SuperRL (Liu et al., 1 Jun 2025): A simple yet effective per-instance reward-gated strategy, where the training objective defaults to SFT if all RL rollouts fail, and otherwise uses policy gradients. This structure avoids wasted updates in sparse-reward regimes and preserves supervised knowledge, enabling SuperRL to deliver +6.6 points over SFT+RL and +16.5 points over SFT alone on GSM8K.
BRIDGE (Chen et al., 8 Sep 2025): Bilevel optimization, as outlined above, adaptively maximizes the cooperative gain, interpolating between SFT and RL as the model (and LoRA adapters) evolve. This method consistently produces faster convergence and higher in-domain and out-of-domain accuracy.

Several works highlight the limits of static mixtures or model merges in combining SFT and RL, especially in multimodal reasoning models where conflicting gradient directions and rigidity of SFT traces can impede RL’s ability to overwrite or exploit these patterns (Chen et al., 10 Jul 2025).

4. Empirical Mechanisms: Interactions, Restoration, and Synergy Boundaries

Recent fine-grained analyses reveal the nuanced dynamics of cooperative SFT and RL:

Function Space Dynamics: SVD-based studies (Jin et al., 8 Sep 2025) show that OOD forgetting in SFT is associated with rotations (not scale) of singular vectors in attention and MLP matrices, which RL can partially reverse to restore OOD generalization lost during SFT. RL’s restoration ability is bounded: it only recovers OOD performance if SFT has not collapsed entropy or underfit.
Reasoning Path Geometry: SFT expands the diversity of correct reasoning trajectories and flattens the use of reasoning steps, while RL squeezes probability mass onto robust, high-success trajectories—steepening the decay of graph-theoretic centrality and visitation metrics. The two-stage SFT→RL scheme is optimal, as SFT alone cannot prune incorrect modes, and RL alone cannot invent new strategies (Matsutani et al., 25 Sep 2025).
Partial Supervision and Branched Rollouts: BREAD (Zhang et al., 20 Jun 2025) demonstrates that for SLMs, partial expert guidance (via episodic hints anchored within RL rollouts) circumvents both SFT inexpressivity and RL reward sparsity, reducing expert trace requirements by >60% and tripling learning speed.

A consistent theme is that SFT provides model coverage and stabilizes representations, while RL prunes or refines them, often concentrating reasoning capacity into fewer, more reliable modes.

5. Use Cases and Task-Specific Insights

The cooperative paradigm extends beyond classic text reasoning:

Sim-Real Vision-Language-Action Training: RLinf-Co (Shi et al., 13 Feb 2026) applies SFT as a warm start on mixed sim+real data, then regularizes RL policy updates in simulation with a persistent real-data supervised loss. This approach yields large real-world success gains (+24% absolute on OpenVLA), stronger sim-to-real generalization, and an order of magnitude improvement in data efficiency.
Instruction Following and Classification: RLSR (Wang et al., 16 Oct 2025) replaces SFT with RL on the SFT dataset using a sentence embedding-based reward function, or combines the two in sequence. SFT+RLSR substantially exceeds pure SFT on AlpacaEval (30.73% vs 21.01%) and boosts open-ended writing metrics.
Safety Alignment: Direct SFT on safety-oriented CoT data can degrade reasoning depth and generalization; RL from safety reward models yields stronger and more consistent gains while preserving reflective entropy where appropriate (Jia et al., 1 Dec 2025).

Tables and brief empirical comparisons are provided below to organize representative findings.

Method	Domain/Task	Key Result
PRISM (Zhao et al., 12 Jan 2026)	WebShop/ALFWorld	+7–10 points SR, >1.7–3.2× faster
BRIDGE (Chen et al., 8 Sep 2025)	Reasoning LLMs	+3.8–12 points avg, faster conv.
SASR (Chen et al., 19 May 2025)	Math/Logic LLMs	Outperforms static/hybrid SFT+RL
SuperRL (Liu et al., 1 Jun 2025)	Math, Table QA	+6–16 points over SFT, generalizes
RLinf-Co (Shi et al., 13 Feb 2026)	VLA, Robotics	+24% RL success, >10× data eff.
RLSR (Wang et al., 16 Oct 2025)	Instruction Tune	+5–10 points AlpacaEval win-rate

6. Limitations, Challenges, and Prospective Advances

While cooperative SFT and RL frameworks represent a significant advance, several challenges remain:

Gradient Interference and Catastrophic Forgetting: Additive or naive mixing of static SFT and RL signals often leads to destructive interference or suboptimal exploration, particularly when data-model mismatch or over-specialized CoT traces inhibit RL updates (Chen et al., 10 Jul 2025, Zhang et al., 1 Feb 2026).
Difficulty-Aware Scheduling: Static curricula are outperformed by adaptive instance-wise routing (as in PRISM and SuperRL). A promising direction is the development of more granular online mathematics or difficulty-driven schedulers (Chen et al., 10 Jul 2025).
Scaling and Generalization: Extensions of gradient-based data arbitration and bilevel optimization to 70B+ parameter agents, as well as domains such as tool use and long-horizon planning, are open areas of inquiry (Zhao et al., 12 Jan 2026).
Efficient Data Utilization: Methods that minimize expert trace usage or exploit partial self-distillation, such as BREAD, are important for SLMs or low-resource regimes (Zhang et al., 20 Jun 2025).

A plausible implication is that future agent alignment pipelines will adopt not just global stagewise cooperation but also per-instance or per-gradient adaptive arbitration between imitation and exploration signals, leveraging algorithmic advances in meta-optimization and internal conflict detection.

7. Best-Practice Recipes and Guidelines

Derived from empirical convergence and ablation studies, the following practices are recommended:

Warm-start from SFT, but calibrate entropy: Avoid SFT over-specialization or entropy collapse; stop at OOD performance peak (Jin et al., 8 Sep 2025, Matsutani et al., 25 Sep 2025).
Apply block- or token-level SFT loss reweighting (PEAR): Reweight tokens by occupancy under the target policy during SFT to reduce mismatch (Zhang et al., 1 Feb 2026).
Couple RL with instance-wise SFT fallback or auxiliary supervision: E.g., SuperRL’s reward-gated switching, RLinf-Co’s real-data regularizer, or Metis-RISE’s RL→SFT sequence (Liu et al., 1 Jun 2025, Shi et al., 13 Feb 2026, Qiu et al., 16 Jun 2025).
Exploit metric-driven data arbitration and cooperative meta-gradients: PRISM’s gradient concentration scoring and BRIDGE’s meta-optimization are notably robust (Zhao et al., 12 Jan 2026, Chen et al., 8 Sep 2025).
Monitor OOD and reasoning metrics throughout training: Early termination, adaptive scheduling, and entropy regularization should be guided by these curves.

In sum, the contemporary state-of-the-art supports cooperative SFT and RL as a unifying and effective framework for alignment, generalization, and safety in large-scale language and multimodal agent models. The field now pursues both theoretical analysis of cooperation (bilevel and IRL formulations) and practical algorithms for adaptive routing, meta-gradients, and reward shaping, driving sustained empirical advances across diverse reasoning and interaction domains.