SFT-RL Multi-domain Training Method
- The paper introduces an SFT-RL method that unifies supervised fine-tuning and reinforcement learning to effectively address multi-domain challenges such as reward sparsity and domain imbalance.
- Dynamic switching and adaptive mixing strategies, including reward-gated and meta-learned controls, are used to balance domain-specific data and mitigate issues like catastrophic forgetting.
- Empirical results demonstrate significant improvements in sample efficiency, robustness, and cross-domain generalization compared to static SFT or pure RL approaches.
A Supervised Fine-Tuning plus Reinforcement Learning (SFT-RL) multi-domain training method refers to any systematic framework that integrates supervised learning from curated data with reward-driven reinforcement learning to optimize large models across diverse, often heterogeneous domains. This paradigm leverages the stability, sample efficiency, and data efficiency of supervised fine-tuning, while exploiting the exploration and generalization capacity of reinforcement learning. Modern SFT-RL frameworks address a range of challenges: reward sparsity, catastrophic forgetting, domain imbalance, transfer efficiency, and curriculum design. This article surveys key algorithmic foundations, design heuristics, theoretical analysis, and empirical results defining the rigorous state of the art in SFT-RL multi-domain training.
1. Core Objectives and Algorithmic Foundations
The SFT-RL training objective is to optimize a model (typically a large language or multimodal foundation model) to perform well across a union of domains , each with its own data distribution, reward scheme, and evaluation metric. Formally, the multi-domain loss comprises two principal components:
- SFT loss: , corresponding to next-token prediction on gold demonstrations.
- RL loss: , corresponding to maximizing the expected trajectory reward in an interaction-based environment.
To robustly exploit multi-domain data and task diversity, recent frameworks implement either dynamic switching between SFT and RL (SuperRL (Liu et al., 1 Jun 2025)), joint optimization with adaptive mixing (AMFT (He et al., 9 Aug 2025), SASR (Chen et al., 19 May 2025)), or bilevel cooperative optimization (BRIDGE (Chen et al., 8 Sep 2025)). A range of curriculum and domain-balancing strategies further support cross-domain generalization and sample efficiency.
2. Instantiations and Switching Policies
Approaches to SFT-RL integration vary along the axis of switching policy and mixture rule:
- Reward-gated switching: For each instance , SuperRL (Liu et al., 1 Jun 2025) samples on-policy rollouts. If any rollout obtains nonzero reward, a policy-gradient RL update (e.g. PPO, GRPO) is applied; otherwise, the model is updated by SFT on high-quality demonstration traces for . This reward-gate operates on an instance level:
where .
- Meta-learned mixture: AMFT (He et al., 9 Aug 2025) unifies SFT and RL in a single objective , where is a learnable parameter updated by a meta-gradient controller to maximize long-term utility, regularized for entropy stability.
- Gradient-criterion adaptation: SASR (Chen et al., 19 May 2025) sets the SFT vs. RL schedule by monitoring the supervised gradient norm , adjusting the stepwise probability of selecting SFT as .
- Bilevel cooperative optimization: BRIDGE (Chen et al., 8 Sep 2025) leverages a bilevel formulation, with the upper level maximizing SFT reward contingent on the lower-level RL-optimized policy. It tracks cooperative gain to ensure SFT continues to benefit RL rather than being forgotten.
3. Multi-Domain Data, Curriculum, and Reward Design
Modern SFT-RL frameworks explicitly target heterogeneity in data source, task type, and evaluation. Common dimensions:
- Domain Sampling and Mixing: SuperRL (Liu et al., 1 Jun 2025) alternates update steps by sampling domains uniformly or following a curriculum. Curriculum-based or complexity-weighted mixing (e.g., (Liu et al., 26 Jan 2026, Li et al., 20 Jul 2025)) increases data coverage and helps mitigate data skew.
- Task-Aligned Reward Schemes: Cross-domain optimization requires reward normalization or hybridization, especially in settings involving both verifiable correctness and preference signals (e.g., code and creative writing). Omni-Thinker (Li et al., 20 Jul 2025) decomposes reward into verifiable and generative preference-based components per domain.
- Stage-wise and Curriculum RL: Curriculum learning sequences domains/tasks progressively from least to most forgettable, thereby minimizing catastrophic forgetting and maximizing transfer (see (Li et al., 20 Jul 2025, Liu et al., 16 Jun 2025)).
- Demonstration Budget and Reward Sparsity: SFT budgets and per-domain batch sizes are tuned to stabilize learning in data-scarce or ultra-sparse reward settings (Liu et al., 1 Jun 2025).
4. Optimization Procedures and Theoretical Properties
The optimization landscape of SFT-RL multi-domain training features both practical techniques and theoretical insights:
- Policy Gradient Algorithms: KL-regularized policy gradient methods (PPO, GRPO) dominate, with group-relative advantages and clipping for stability (Liu et al., 1 Jun 2025, Chen et al., 19 May 2025, Li et al., 20 Jul 2025).
- Orthogonality of SFT and RL Updates: Empirical findings indicate that SFT and RL induce nearly orthogonal parameter updates, motivating modular skill transfer and layer-wise skill injection (PaST, (Tang et al., 16 Jan 2026)).
- Robustness and Generalization: Multi-domain SFT-RL reduces variance in policy entropy, improves robustness to reward sparsity, and generalizes better to OOD tasks vs. either SFT or RL alone. Explicit balancing controllers (AMFT (He et al., 9 Aug 2025), SASR (Chen et al., 19 May 2025)) help maintain stability across long training horizons.
5. Empirical Results, Ablation Studies, and Scaling Laws
SFT-RL multi-domain methods empirically outperform both static SFT and pure RL on a wide spectrum of benchmarks and diagnostic splits:
- Generalization and Sample Efficiency: On sparse-reward reasoning (OpenR1, PRM12K), SuperRL (Liu et al., 1 Jun 2025) delivers up to +30 percentage points over pure RL and outperforms vanilla RL in 24/35 multi-domain train–test settings.
- Catastrophic Forgetting Mitigation: Bilevel optimization (BRIDGE (Chen et al., 8 Sep 2025)) improves accuracy (+11.8% over cold-start, +44% faster training), with OOD robustness (avg. 38.7% vs. 24.8% for cold-start RL).
- Curriculum and Hybrid-Reward Effectiveness: Curriculum-driven RL in Omni-Thinker (Li et al., 20 Jul 2025) yields +5.2% over joint training and +9.1% over model merging across math, code, QA, and writing.
- Scaling and Transferability: Modular skill transfer (PaST (Tang et al., 16 Jan 2026)) enables zero-shot transfer of procedural vectorized RL skills to unseen domains, with up to +10.3 point success-rate gains on tool-use benchmarks and persistent improvement on long-context QA.
- Ablative Insights: Across frameworks, removing meta-controllers (AMFT), dynamic mixing (SASR), or reward gating (SuperRL) degrades OOD accuracy and increases training instability.
6. Limitations and Future Research Directions
Despite demonstrated success, SFT-RL multi-domain methods exhibit open challenges:
- Data Curation and Annotation: Approaches such as BRIDGE (Chen et al., 8 Sep 2025) and SuperRL (Liu et al., 1 Jun 2025) require curated demonstration datasets. Extending these algorithms to settings with limited or noisy supervision, or fully zero-shot regimes, remains a challenge.
- Hyperparameter Sensitivity and Controller Design: While dynamic controllers improve robustness, sensitivity to entropy targets, switching frequency, and mixing schedules still affects practical deployment (see AMFT (He et al., 9 Aug 2025)).
- Domain Adaptation and Transfer: Efficient adaptation to new domains without access to on-policy RL or task-specific rewards is addressed by modular skill transfer (PaST (Tang et al., 16 Jan 2026)), but further research is warranted for continual learning scenarios.
- Extension to Multi-modal and Open-ended Tasks: Most published studies target reasoning and code, with some progress in vision-language (V-IRL, General Points). Extending SFT-RL to multi-modal or non-reasoning tasks remains an active area of research.
7. SFT-RL Method Variants and Representative Implementations
Below is a comparative summary of influential SFT-RL multi-domain frameworks:
| Method | Mix Policy | Multi-domain Handling | Key Empirical Finding |
|---|---|---|---|
| SuperRL (Liu et al., 1 Jun 2025) | Instance-level reward-gate | Domain sampling, per-domain SFT budget | +30pp on sparse tasks; stable cross-domain gen. |
| BRIDGE (Chen et al., 8 Sep 2025) | Bilevel, cooperative gain | Benchmark-aligned CoT & verifier adaption | +11.8% over cold-start RL, improved OOD |
| SASR (Chen et al., 19 May 2025) | Gradient-norm adaptive mix | Adaptive stepwise task mix | Robust, outperforms static mixes across logic/math |
| AMFT (He et al., 9 Aug 2025) | Meta-gradient controller | Dynamic meta-learned mixture | SOTA on math, vision, OOD; stability, efficiency |
| Omni-Thinker (Li et al., 20 Jul 2025) | Hybrid reward + curriculum | Forgetting-aware, curriculum-based phases | +5.2% over joint, superior QA/creative transfer |
| PaST (Tang et al., 16 Jan 2026) | Modular skill vector | Skill transfer, domain-agnostic injection | Zero-shot tool QA/agent gains, high efficiency |
Each of these frameworks is distinguished by its approach to switching/mixing SFT and RL, its domain balancing strategy, and its empirical robustness across widely differing task and reward landscapes. These methods collectively define the modern standards and transferable lessons in SFT-RL multi-domain training research.