Thinking-Fusion: SFT-RL Hybrid Paradigm

Updated 18 September 2025

The Thinking-Fusion SFT-RL Hybrid Paradigm is a method that combines supervised fine-tuning and reinforcement learning for enhanced reasoning in language and vision-language models.
It employs dynamic balancing, meta-gradient adaptation, and reward rectification to address issues like catastrophic forgetting and to improve exploration and generalization.
Empirical results demonstrate significant accuracy improvements and increased sample efficiency, outperforming traditional sequential SFT and RL pipelines on complex tasks.

The Thinking-Fusion SFT-RL Hybrid Paradigm denotes a class of training strategies and theoretical frameworks that tightly integrate supervised fine-tuning (SFT) and reinforcement learning (RL) to endow models—especially LLMs and vision-LLMs—with advanced, multi-turn reasoning and decision-making abilities. Unlike rigid two-stage procedures, recent research establishes that seamless, dynamic, or cooperative interactions between SFT and RL can mediate the trade-off between memorization, generalization, and efficient exploration, thereby unlocking enhanced reasoning skills.

1. Principal Problem and Motivation

The development of language and vision-LLMs for complex reasoning tasks demands policies that retain beneficial prior knowledge (from pretraining or behavioral cloning), can plan over long horizons, and generalize effectively to out-of-distribution (OOD) scenarios. Standard SFT excels at imitation and format alignment but often leads to over-specialization and OOD forgetting, while RL provides explicit credit assignment (via environmental rewards) but is prone to sample inefficiency, mode collapse, and difficulties in sparse-reward settings (Liu et al., 1 Jun 2025, Zhang et al., 20 Jun 2025, Chen et al., 10 Jul 2025). The traditional sequential pipeline—SFT as warm-up, then RL—has been observed to cause catastrophic forgetting, pseudo-reasoning path lock-in, or suboptimal transfer (Chen et al., 10 Apr 2025, Chen et al., 10 Jul 2025). The Thinking-Fusion paradigm aims to fuse both paradigms, leveraging theoretical dualities and practical algorithmic innovations for robust and efficient reasoning.

2. Theoretical Unification of SFT and RL

Several recent works reinterpret SFT and RL as complementary forms of reward optimization or policy regularization. For example, SFT is mathematically framed as a policy gradient with an indicator reward and implicit importance weighting, while RL explicitly optimizes episodic or token-level reward (Wu et al., 7 Aug 2025, He et al., 9 Aug 2025). The Gradient of the SFT loss is:

$\nabla_\theta \mathcal{L}_{\mathrm{SFT}}(\theta) = - \mathbb{E}_{(x, y^*) \sim \mathcal{D}} \left[ \nabla_\theta \log \pi_\theta(y^*|x) \right],$

but under a distributional change this is

$- \mathbb{E}_x \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} \left[ \frac{\mathbb{I}[y = y^*]}{\pi_\theta(y|x)} \nabla_\theta \log \pi_\theta(y|x) \right],$

revealing an implicit inverse probability weighting that induces high-variance updates and generalization degradation (Wu et al., 7 Aug 2025). Dynamic Fine-Tuning (DFT) corrects this by reweighting each token’s contribution by its probability, unifying the update and greatly improving generalization.

Furthermore, hybrid objectives have been formalized as meta-optimization problems or dynamic convex combinations. The AMFT algorithm, for instance, posits the overall loss as

$\mathcal{L}_{\text{total}}(\theta; \mu) = (1 - \mu) \mathcal{L}_{\mathrm{RL}}(\theta) + \mu \mathcal{L}_{\mathrm{SFT}}(\theta),$

with $\mu$ meta-learned by gradient-based adaptation to best balance imitation (stability, path-level reward) against exploration (task reward), regulated through entropy and validation utility (He et al., 9 Aug 2025).

3. Methodological Realizations

Thinking-Fusion manifests in several algorithmic and practical forms, broadly categorized as:

Approach	Core Mechanism	Representative Papers
Weighted or Dynamic Loss	Adaptive, instance-based, or meta-learned weight balancing SFT and RL objectives	(He et al., 9 Aug 2025, Fu et al., 24 Jun 2025, Liu et al., 1 Jun 2025)
Cooperative/Bilevel	Conditioning SFT on the RL-optimal solution for cooperative gain via bilevel optimization	(Chen et al., 8 Sep 2025)
Single-Stage/Entropy-Aware	Joint SFT+RL updates with entropy- or uncertainty-based weighting of per-loss contributions	(Fu et al., 24 Jun 2025)
Reward Rectification	Gradient reweighting (DFT) to harmonize SFT with RL’s policy update structure	(Wu et al., 7 Aug 2025)
Progressive/Hybrid Curricula	Warmup SFT, then RL, possibly with staged/interleaved progression or curriculum over tasks	(Chen et al., 19 May 2025, Yoshihara et al., 11 Jul 2025)

The SRFT method, for example, unifies SFT and RL in a single-stage loss:

$\mathcal{L}_{\mathrm{SRFT}}(\theta) = \mathcal{L}^{demo}_{\mathrm{SFT}}(\theta) + \mathcal{L}^{demo}_{\mathrm{RL}}(\theta) + \mathcal{L}^{self}_{\mathrm{RL}}(\theta)$

with entropy-aware coefficients modulating the influence of each component (Fu et al., 24 Jun 2025). The SASR framework instead uses the gradient norm of the SFT loss as a dynamic proxy for proximity to the demonstration distribution, probabilistically mixing updates at every step (Chen et al., 19 May 2025).

4. Empirical Outcomes and Performance Characteristics

Extensive empirical validation across domains demonstrates the utility of Thinking-Fusion strategies:

Accuracy and sample efficiency: Hybrid approaches such as SuperRL and SASR yield higher test accuracy and faster convergence over both pure SFT and RL on benchmarks requiring stepwise reasoning—e.g., accuracy improvements of up to 20 percentage points in sparse-reward settings (Liu et al., 1 Jun 2025, Chen et al., 19 May 2025).
Generalization: Dynamic adaptation and meta-learned balancing mitigate OOD forgetting and reinforce transfer, with AMFT and DFT models reporting high OOD accuracy even as SFT alone over-specializes (Wu et al., 7 Aug 2025, He et al., 9 Aug 2025).
Model-specific gains: In small parameter regimes or capacity-limited models, dynamic interleaving (e.g., DyME, BREAD) overcomes mode collapse and pseudo reasoning artifacts, enabling reliable stepwise inference that would otherwise be unattainable for such models (Liu et al., 29 Jun 2025, Zhang et al., 20 Jun 2025).
Response format and efficiency: Two-stage SFT-RL pipelines in math LLMs (e.g., (Yoshihara et al., 11 Jul 2025)) produce state-of-the-art accuracy with subsequent RL optimizing brevity and avoiding the verbosity characteristic of extended SFT.

Salient limitations are also observed: In vision-LLMs, naive combinations of SFT and RL often induce irreducible trade-offs between depth (SFT-induced detailed reasoning) and brevity (RL-induced conciseness), with “synergy dilemmas” observed in additive fusion attempts (Chen et al., 10 Jul 2025, Chen et al., 10 Apr 2025).

5. Mechanisms of Synergy, Forgetting, and Recovery

The synergy in Thinking-Fusion paradigms arises not from simple sequential stacking but from carefully coordinated or adaptively balanced objectives. Several mechanisms underpinning this synergy have been identified:

SFT “hard-aligns” model subspaces to demonstration domains but eventually induces OOD forgetting via excessive rotation of parameter singular vectors.
RL, when correctly timed, “restores” lost generalization by partially re-aligning these subspaces, serving as a parameter space regularizer (Jin et al., 8 Sep 2025).
Meta-adaptation or dynamic balancing identifies stop points and optimal switching policies, addressing the failure modes of catastrophic forgetting and locked reasoning paths.

For example, SVD-based analysis shows that OOD restoration after SFT is tightly linked to the rotation of singular vectors rather than the magnitude of singular values, with the RL step effectively acting as a soft correction in this geometrically meaningful sense (Jin et al., 8 Sep 2025).

6. Design Principles and Implementation Guidance

Best practices and design implications established by this body of work include:

Dynamic, rather than static, scheduling of SFT and RL guidance—based on uncertainty, task difficulty, gradient-based proxies, or explicit meta-optimization—is critical for robust performance.
Entropy-aware or meta-gradient controllers provide principled, data-driven mechanisms for adjusting the imitation-exploration trade-off during training (He et al., 9 Aug 2025, Fu et al., 24 Jun 2025).
Reward rectification, self-distillation, and branched rollouts with expert anchors allow hybridization to succeed even in regimes where either SFT or RL alone fail due to limited data, sparse rewards, or small model capacity (Zhang et al., 20 Jun 2025).
Instance-level or token-level adaptation further tightens the feedback loop, leading to higher sample efficiency and finer control of model policy shifts (Liu et al., 1 Jun 2025, Chen et al., 19 May 2025).

The table below summarizes the dominant mechanisms:

Mechanism	SFT Benefit	RL Benefit	Fusion Benefit
Curriculum/progression	Format/structure learning	Exploration, rationality	SFT guides, RL refines
Meta-gradient balancing	Stability, avoids forgetting	Adaptable, pushes OOD generalization	Maximal utility achieved
Reward rectification	Low-variance updates	Consistent reward propagation	Improved generalization

7. Implications, Open Problems, and Future Directions

Thinking-Fusion SFT-RL Hybrid Paradigms are foundational for current and future advances in reasoning-capable AI. Notable implications and frontiers include:

Multi-modal and open-domain applications: Iterative or cyclic combinations of RL and SFT (as in Metis-RISE) are effective in domains extending beyond math/code to visual and interactive environments.
Task curriculum and sampling: Hybrid reward systems (deterministic for structured tasks, preference-based for generative/subjective tasks) scaled by curriculum design (e.g., Omni-Thinker (Li et al., 20 Jul 2025)) yield improved generalization and robustness to forgetting.
Representation geometry as a control and diagnostic tool: Monitoring and optimizing the rotation of singular vectors in parameter space may serve as a generalized regularization or early stopping signal, particularly for OOD robustness (Jin et al., 8 Sep 2025).
Adaptive fusion at inference: Pursuit of architectures or controllers that can select, at inference or per-instance, the desired blend of SFT-derived structure versus RL-derived brevity and adaptability (Chen et al., 10 Jul 2025).

Significant open issues remain—in particular, resolving the “synergy dilemma” of reasoning style and response trade-offs in high-dimensional or multi-modal settings, understanding the ultimate limits of cooperative or meta-learned controllers, and extending dynamic fusion to complex decision domains beyond existing benchmarks. Techniques such as self-distillation, model-driven verifier construction, and finer-grained curriculum adaptation represent promising avenues for next-generation hybrid paradigms.

In summary, the Thinking-Fusion SFT-RL Hybrid Paradigm constitutes a rigorously justified, empirically validated strategy for reasoning model training. By unifying, dynamically balancing, or tightly coupling demonstration learning and reward-driven exploration, it achieves superior trade-offs in accuracy, generalization, sample efficiency, and robustness, underpinning the transition from static imitation systems to reliably “thinking” AI agents across language, vision, and interactive domains.