Papers
Topics
Authors
Recent
Search
2000 character limit reached

SFT-RL Multi-domain Training Method

Updated 30 January 2026
  • The paper introduces an SFT-RL method that unifies supervised fine-tuning and reinforcement learning to effectively address multi-domain challenges such as reward sparsity and domain imbalance.
  • Dynamic switching and adaptive mixing strategies, including reward-gated and meta-learned controls, are used to balance domain-specific data and mitigate issues like catastrophic forgetting.
  • Empirical results demonstrate significant improvements in sample efficiency, robustness, and cross-domain generalization compared to static SFT or pure RL approaches.

A Supervised Fine-Tuning plus Reinforcement Learning (SFT-RL) multi-domain training method refers to any systematic framework that integrates supervised learning from curated data with reward-driven reinforcement learning to optimize large models across diverse, often heterogeneous domains. This paradigm leverages the stability, sample efficiency, and data efficiency of supervised fine-tuning, while exploiting the exploration and generalization capacity of reinforcement learning. Modern SFT-RL frameworks address a range of challenges: reward sparsity, catastrophic forgetting, domain imbalance, transfer efficiency, and curriculum design. This article surveys key algorithmic foundations, design heuristics, theoretical analysis, and empirical results defining the rigorous state of the art in SFT-RL multi-domain training.

1. Core Objectives and Algorithmic Foundations

The SFT-RL training objective is to optimize a model πθ\pi_\theta (typically a large language or multimodal foundation model) to perform well across a union of domains D1,...,DM\mathcal{D}_1, ..., \mathcal{D}_M, each with its own data distribution, reward scheme, and evaluation metric. Formally, the multi-domain loss comprises two principal components:

  • SFT loss: LSFT(θ)=(x,y)DSFTlogpθ(yx)L_\mathrm{SFT}(\theta) = -\sum_{(x, y) \in D_\mathrm{SFT}}\log p_\theta(y|x), corresponding to next-token prediction on gold demonstrations.
  • RL loss: LRL(θ)=Eτπθ[tR(τt)]L_\mathrm{RL}(\theta) = -\mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t R(\tau_t)\right], corresponding to maximizing the expected trajectory reward in an interaction-based environment.

To robustly exploit multi-domain data and task diversity, recent frameworks implement either dynamic switching between SFT and RL (SuperRL (Liu et al., 1 Jun 2025)), joint optimization with adaptive mixing (AMFT (He et al., 9 Aug 2025), SASR (Chen et al., 19 May 2025)), or bilevel cooperative optimization (BRIDGE (Chen et al., 8 Sep 2025)). A range of curriculum and domain-balancing strategies further support cross-domain generalization and sample efficiency.

2. Instantiations and Switching Policies

Approaches to SFT-RL integration vary along the axis of switching policy and mixture rule:

  • Reward-gated switching: For each instance xx, SuperRL (Liu et al., 1 Jun 2025) samples KK on-policy rollouts. If any rollout obtains nonzero reward, a policy-gradient RL update (e.g. PPO, GRPO) is applied; otherwise, the model is updated by SFT on high-quality demonstration traces for xx. This reward-gate operates on an instance level:

LSuperRL(θ;x)=(1c(x))LSFT(θ;x)+c(x)LPG(θ;x)L_\text{SuperRL}(\theta; x) = (1-c(x))L_\mathrm{SFT}(\theta; x) + c(x)L_\mathrm{PG}(\theta; x)

where c(x)=1[maxkR(x,y~k)>0]c(x)=1[\max_k R(x, \tilde y_k)>0].

  • Meta-learned mixture: AMFT (He et al., 9 Aug 2025) unifies SFT and RL in a single objective Ltotal(θ;μt)=(1μt)LRL(θ)+μtLSFT(θ)L_\mathrm{total}(\theta;\mu_t) = (1-\mu_t)L_\mathrm{RL}(\theta) + \mu_t L_\mathrm{SFT}(\theta), where μt\mu_t is a learnable parameter updated by a meta-gradient controller to maximize long-term utility, regularized for entropy stability.
  • Gradient-criterion adaptation: SASR (Chen et al., 19 May 2025) sets the SFT vs. RL schedule by monitoring the supervised gradient norm θLSFT(θ)\|\nabla_\theta L_\mathrm{SFT}(\theta)\|, adjusting the stepwise probability of selecting SFT as pt=Glast SFT/(Glast SFT+γGwarmup)p_t = G_\text{last SFT} / (G_\text{last SFT} + \gamma G_\text{warmup}).
  • Bilevel cooperative optimization: BRIDGE (Chen et al., 8 Sep 2025) leverages a bilevel formulation, with the upper level maximizing SFT reward contingent on the lower-level RL-optimized policy. It tracks cooperative gain to ensure SFT continues to benefit RL rather than being forgotten.

3. Multi-Domain Data, Curriculum, and Reward Design

Modern SFT-RL frameworks explicitly target heterogeneity in data source, task type, and evaluation. Common dimensions:

  • Domain Sampling and Mixing: SuperRL (Liu et al., 1 Jun 2025) alternates update steps by sampling domains uniformly or following a curriculum. Curriculum-based or complexity-weighted mixing (e.g., (Liu et al., 26 Jan 2026, Li et al., 20 Jul 2025)) increases data coverage and helps mitigate data skew.
  • Task-Aligned Reward Schemes: Cross-domain optimization requires reward normalization or hybridization, especially in settings involving both verifiable correctness and preference signals (e.g., code and creative writing). Omni-Thinker (Li et al., 20 Jul 2025) decomposes reward into verifiable and generative preference-based components per domain.
  • Stage-wise and Curriculum RL: Curriculum learning sequences domains/tasks progressively from least to most forgettable, thereby minimizing catastrophic forgetting and maximizing transfer (see (Li et al., 20 Jul 2025, Liu et al., 16 Jun 2025)).
  • Demonstration Budget and Reward Sparsity: SFT budgets MdM_d and per-domain batch sizes are tuned to stabilize learning in data-scarce or ultra-sparse reward settings (Liu et al., 1 Jun 2025).

4. Optimization Procedures and Theoretical Properties

The optimization landscape of SFT-RL multi-domain training features both practical techniques and theoretical insights:

5. Empirical Results, Ablation Studies, and Scaling Laws

SFT-RL multi-domain methods empirically outperform both static SFT and pure RL on a wide spectrum of benchmarks and diagnostic splits:

  • Generalization and Sample Efficiency: On sparse-reward reasoning (OpenR1, PRM12K), SuperRL (Liu et al., 1 Jun 2025) delivers up to +30 percentage points over pure RL and outperforms vanilla RL in 24/35 multi-domain train–test settings.
  • Catastrophic Forgetting Mitigation: Bilevel optimization (BRIDGE (Chen et al., 8 Sep 2025)) improves accuracy (+11.8% over cold-start, +44% faster training), with OOD robustness (avg. 38.7% vs. 24.8% for cold-start RL).
  • Curriculum and Hybrid-Reward Effectiveness: Curriculum-driven RL in Omni-Thinker (Li et al., 20 Jul 2025) yields +5.2% over joint training and +9.1% over model merging across math, code, QA, and writing.
  • Scaling and Transferability: Modular skill transfer (PaST (Tang et al., 16 Jan 2026)) enables zero-shot transfer of procedural vectorized RL skills to unseen domains, with up to +10.3 point success-rate gains on tool-use benchmarks and persistent improvement on long-context QA.
  • Ablative Insights: Across frameworks, removing meta-controllers (AMFT), dynamic mixing (SASR), or reward gating (SuperRL) degrades OOD accuracy and increases training instability.

6. Limitations and Future Research Directions

Despite demonstrated success, SFT-RL multi-domain methods exhibit open challenges:

  • Data Curation and Annotation: Approaches such as BRIDGE (Chen et al., 8 Sep 2025) and SuperRL (Liu et al., 1 Jun 2025) require curated demonstration datasets. Extending these algorithms to settings with limited or noisy supervision, or fully zero-shot regimes, remains a challenge.
  • Hyperparameter Sensitivity and Controller Design: While dynamic controllers improve robustness, sensitivity to entropy targets, switching frequency, and mixing schedules still affects practical deployment (see AMFT (He et al., 9 Aug 2025)).
  • Domain Adaptation and Transfer: Efficient adaptation to new domains without access to on-policy RL or task-specific rewards is addressed by modular skill transfer (PaST (Tang et al., 16 Jan 2026)), but further research is warranted for continual learning scenarios.
  • Extension to Multi-modal and Open-ended Tasks: Most published studies target reasoning and code, with some progress in vision-language (V-IRL, General Points). Extending SFT-RL to multi-modal or non-reasoning tasks remains an active area of research.

7. SFT-RL Method Variants and Representative Implementations

Below is a comparative summary of influential SFT-RL multi-domain frameworks:

Method Mix Policy Multi-domain Handling Key Empirical Finding
SuperRL (Liu et al., 1 Jun 2025) Instance-level reward-gate Domain sampling, per-domain SFT budget +30pp on sparse tasks; stable cross-domain gen.
BRIDGE (Chen et al., 8 Sep 2025) Bilevel, cooperative gain Benchmark-aligned CoT & verifier adaption +11.8% over cold-start RL, improved OOD
SASR (Chen et al., 19 May 2025) Gradient-norm adaptive mix Adaptive stepwise task mix Robust, outperforms static mixes across logic/math
AMFT (He et al., 9 Aug 2025) Meta-gradient controller Dynamic meta-learned mixture SOTA on math, vision, OOD; stability, efficiency
Omni-Thinker (Li et al., 20 Jul 2025) Hybrid reward + curriculum Forgetting-aware, curriculum-based phases +5.2% over joint, superior QA/creative transfer
PaST (Tang et al., 16 Jan 2026) Modular skill vector Skill transfer, domain-agnostic injection Zero-shot tool QA/agent gains, high efficiency

Each of these frameworks is distinguished by its approach to switching/mixing SFT and RL, its domain balancing strategy, and its empirical robustness across widely differing task and reward landscapes. These methods collectively define the modern standards and transferable lessons in SFT-RL multi-domain training research.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SFT-RL Multi-domain Training Method.