SFT-RL Multi-domain Training Method

Updated 30 January 2026

The paper introduces an SFT-RL method that unifies supervised fine-tuning and reinforcement learning to effectively address multi-domain challenges such as reward sparsity and domain imbalance.
Dynamic switching and adaptive mixing strategies, including reward-gated and meta-learned controls, are used to balance domain-specific data and mitigate issues like catastrophic forgetting.
Empirical results demonstrate significant improvements in sample efficiency, robustness, and cross-domain generalization compared to static SFT or pure RL approaches.

A Supervised Fine-Tuning plus Reinforcement Learning (SFT-RL) multi-domain training method refers to any systematic framework that integrates supervised learning from curated data with reward-driven reinforcement learning to optimize large models across diverse, often heterogeneous domains. This paradigm leverages the stability, sample efficiency, and data efficiency of supervised fine-tuning, while exploiting the exploration and generalization capacity of reinforcement learning. Modern SFT-RL frameworks address a range of challenges: reward sparsity, catastrophic forgetting, domain imbalance, transfer efficiency, and curriculum design. This article surveys key algorithmic foundations, design heuristics, theoretical analysis, and empirical results defining the rigorous state of the art in SFT-RL multi-domain training.

1. Core Objectives and Algorithmic Foundations

The SFT-RL training objective is to optimize a model $\pi_\theta$ (typically a large language or multimodal foundation model) to perform well across a union of domains $\mathcal{D}_1, ..., \mathcal{D}_M$ , each with its own data distribution, reward scheme, and evaluation metric. Formally, the multi-domain loss comprises two principal components:

SFT loss: $L_\mathrm{SFT}(\theta) = -\sum_{(x, y) \in D_\mathrm{SFT}}\log p_\theta(y|x)$ , corresponding to next-token prediction on gold demonstrations.
RL loss: $L_\mathrm{RL}(\theta) = -\mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t R(\tau_t)\right]$ , corresponding to maximizing the expected trajectory reward in an interaction-based environment.

To robustly exploit multi-domain data and task diversity, recent frameworks implement either dynamic switching between SFT and RL (SuperRL (Liu et al., 1 Jun 2025)), joint optimization with adaptive mixing (AMFT (He et al., 9 Aug 2025), SASR (Chen et al., 19 May 2025)), or bilevel cooperative optimization (BRIDGE (Chen et al., 8 Sep 2025)). A range of curriculum and domain-balancing strategies further support cross-domain generalization and sample efficiency.

2. Instantiations and Switching Policies

Approaches to SFT-RL integration vary along the axis of switching policy and mixture rule:

Reward-gated switching: For each instance $x$ , SuperRL (Liu et al., 1 Jun 2025) samples $K$ on-policy rollouts. If any rollout obtains nonzero reward, a policy-gradient RL update (e.g. PPO, GRPO) is applied; otherwise, the model is updated by SFT on high-quality demonstration traces for $x$ . This reward-gate operates on an instance level:

$L_\text{SuperRL}(\theta; x) = (1-c(x))L_\mathrm{SFT}(\theta; x) + c(x)L_\mathrm{PG}(\theta; x)$

where $c(x)=1[\max_k R(x, \tilde y_k)>0]$ .

Meta-learned mixture: AMFT (He et al., 9 Aug 2025) unifies SFT and RL in a single objective $L_\mathrm{total}(\theta;\mu_t) = (1-\mu_t)L_\mathrm{RL}(\theta) + \mu_t L_\mathrm{SFT}(\theta)$ , where $\mu_t$ is a learnable parameter updated by a meta-gradient controller to maximize long-term utility, regularized for entropy stability.
Gradient-criterion adaptation: SASR (Chen et al., 19 May 2025) sets the SFT vs. RL schedule by monitoring the supervised gradient norm $\|\nabla_\theta L_\mathrm{SFT}(\theta)\|$ , adjusting the stepwise probability of selecting SFT as $p_t = G_\text{last SFT} / (G_\text{last SFT} + \gamma G_\text{warmup})$ .
Bilevel cooperative optimization: BRIDGE (Chen et al., 8 Sep 2025) leverages a bilevel formulation, with the upper level maximizing SFT reward contingent on the lower-level RL-optimized policy. It tracks cooperative gain to ensure SFT continues to benefit RL rather than being forgotten.

3. Multi-Domain Data, Curriculum, and Reward Design

Modern SFT-RL frameworks explicitly target heterogeneity in data source, task type, and evaluation. Common dimensions:

Domain Sampling and Mixing: SuperRL (Liu et al., 1 Jun 2025) alternates update steps by sampling domains uniformly or following a curriculum. Curriculum-based or complexity-weighted mixing (e.g., (Liu et al., 26 Jan 2026, Li et al., 20 Jul 2025)) increases data coverage and helps mitigate data skew.
Task-Aligned Reward Schemes: Cross-domain optimization requires reward normalization or hybridization, especially in settings involving both verifiable correctness and preference signals (e.g., code and creative writing). Omni-Thinker (Li et al., 20 Jul 2025) decomposes reward into verifiable and generative preference-based components per domain.
Stage-wise and Curriculum RL: Curriculum learning sequences domains/tasks progressively from least to most forgettable, thereby minimizing catastrophic forgetting and maximizing transfer (see (Li et al., 20 Jul 2025, Liu et al., 16 Jun 2025)).
Demonstration Budget and Reward Sparsity: SFT budgets $M_d$ and per-domain batch sizes are tuned to stabilize learning in data-scarce or ultra-sparse reward settings (Liu et al., 1 Jun 2025).

4. Optimization Procedures and Theoretical Properties

The optimization landscape of SFT-RL multi-domain training features both practical techniques and theoretical insights:

Policy Gradient Algorithms: KL-regularized policy gradient methods (PPO, GRPO) dominate, with group-relative advantages and clipping for stability (Liu et al., 1 Jun 2025, Chen et al., 19 May 2025, Li et al., 20 Jul 2025).
Orthogonality of SFT and RL Updates: Empirical findings indicate that SFT and RL induce nearly orthogonal parameter updates, motivating modular skill transfer and layer-wise skill injection (PaST, (Tang et al., 16 Jan 2026)).
Robustness and Generalization: Multi-domain SFT-RL reduces variance in policy entropy, improves robustness to reward sparsity, and generalizes better to OOD tasks vs. either SFT or RL alone. Explicit balancing controllers (AMFT (He et al., 9 Aug 2025), SASR (Chen et al., 19 May 2025)) help maintain stability across long training horizons.

5. Empirical Results, Ablation Studies, and Scaling Laws

SFT-RL multi-domain methods empirically outperform both static SFT and pure RL on a wide spectrum of benchmarks and diagnostic splits:

Generalization and Sample Efficiency: On sparse-reward reasoning (OpenR1, PRM12K), SuperRL (Liu et al., 1 Jun 2025) delivers up to +30 percentage points over pure RL and outperforms vanilla RL in 24/35 multi-domain train–test settings.
Catastrophic Forgetting Mitigation: Bilevel optimization (BRIDGE (Chen et al., 8 Sep 2025)) improves accuracy (+11.8% over cold-start, +44% faster training), with OOD robustness (avg. 38.7% vs. 24.8% for cold-start RL).
Curriculum and Hybrid-Reward Effectiveness: Curriculum-driven RL in Omni-Thinker (Li et al., 20 Jul 2025) yields +5.2% over joint training and +9.1% over model merging across math, code, QA, and writing.
Scaling and Transferability: Modular skill transfer (PaST (Tang et al., 16 Jan 2026)) enables zero-shot transfer of procedural vectorized RL skills to unseen domains, with up to +10.3 point success-rate gains on tool-use benchmarks and persistent improvement on long-context QA.
Ablative Insights: Across frameworks, removing meta-controllers (AMFT), dynamic mixing (SASR), or reward gating (SuperRL) degrades OOD accuracy and increases training instability.

6. Limitations and Future Research Directions

Despite demonstrated success, SFT-RL multi-domain methods exhibit open challenges:

Data Curation and Annotation: Approaches such as BRIDGE (Chen et al., 8 Sep 2025) and SuperRL (Liu et al., 1 Jun 2025) require curated demonstration datasets. Extending these algorithms to settings with limited or noisy supervision, or fully zero-shot regimes, remains a challenge.
Hyperparameter Sensitivity and Controller Design: While dynamic controllers improve robustness, sensitivity to entropy targets, switching frequency, and mixing schedules still affects practical deployment (see AMFT (He et al., 9 Aug 2025)).
Domain Adaptation and Transfer: Efficient adaptation to new domains without access to on-policy RL or task-specific rewards is addressed by modular skill transfer (PaST (Tang et al., 16 Jan 2026)), but further research is warranted for continual learning scenarios.
Extension to Multi-modal and Open-ended Tasks: Most published studies target reasoning and code, with some progress in vision-language (V-IRL, General Points). Extending SFT-RL to multi-modal or non-reasoning tasks remains an active area of research.

7. SFT-RL Method Variants and Representative Implementations

Below is a comparative summary of influential SFT-RL multi-domain frameworks:

Method	Mix Policy	Multi-domain Handling	Key Empirical Finding
SuperRL (Liu et al., 1 Jun 2025)	Instance-level reward-gate	Domain sampling, per-domain SFT budget	+30pp on sparse tasks; stable cross-domain gen.
BRIDGE (Chen et al., 8 Sep 2025)	Bilevel, cooperative gain	Benchmark-aligned CoT & verifier adaption	+11.8% over cold-start RL, improved OOD
SASR (Chen et al., 19 May 2025)	Gradient-norm adaptive mix	Adaptive stepwise task mix	Robust, outperforms static mixes across logic/math
AMFT (He et al., 9 Aug 2025)	Meta-gradient controller	Dynamic meta-learned mixture	SOTA on math, vision, OOD; stability, efficiency
Omni-Thinker (Li et al., 20 Jul 2025)	Hybrid reward + curriculum	Forgetting-aware, curriculum-based phases	+5.2% over joint, superior QA/creative transfer
PaST (Tang et al., 16 Jan 2026)	Modular skill vector	Skill transfer, domain-agnostic injection	Zero-shot tool QA/agent gains, high efficiency

Each of these frameworks is distinguished by its approach to switching/mixing SFT and RL, its domain balancing strategy, and its empirical robustness across widely differing task and reward landscapes. These methods collectively define the modern standards and transferable lessons in SFT-RL multi-domain training research.