Papers
Topics
Authors
Recent
Search
2000 character limit reached

Importance-Weighted SFT (iw-SFT)

Updated 6 January 2026
  • The paper introduces iw-SFT, which applies importance weighting to correct distribution mismatches between training data and the target policy, thereby providing a tighter lower bound on sparse-reward RL objectives.
  • It leverages auxiliary proposal distributions, robust variance control techniques like KL anchoring and data rewriting, and adaptive fine-tuning strategies to maintain training stability.
  • Empirical results demonstrate substantial performance gains in language modeling, mathematical reasoning, and continuous control, establishing iw-SFT as an efficient alternative to conventional RL methods.

Importance-Weighted Supervised Fine-Tuning (iw-SFT) generalizes conventional supervised fine-tuning (SFT) by explicitly correcting for distributional mismatch between training data and the target model policy through importance weighting. Recent work establishes iw-SFT as a principled bridge between SFT and reinforcement learning (RL), demonstrating that it both tightens the lower bound SFT provides on sparse-reward RL objectives and enables substantial performance gains in language modeling, mathematical reasoning, and continuous control domains. By leveraging auxiliary proposal distributions and robust variance-control mechanisms, iw-SFT achieves more faithful policy optimization with minimal computational overhead, and is easily adaptable to diverse data curation schemes and downstream applications.

1. Theoretical Foundations and Derivation

The canonical SFT objective is the maximum likelihood estimation (MLE) of a model on filtered or high-quality data, which can be construed as a loose lower bound on the expected RL return in sparse-reward settings. For an agent producing a trajectory τ\tau, the RL objective is:

J(θ)=Eτp(;θ)[R(τ)],J(\theta) = \mathbb{E}_{\tau \sim p(\cdot; \theta)}[R(\tau)],

where R(τ)R(\tau) is a terminal sparse reward indicator, and p(τ;θ)p(\tau;\theta) represents the trajectory distribution under policy πθ\pi_\theta. If only “successful” (i.e., R(τ)=1R(\tau) = 1) rollouts drawn from a reference policy πref\pi_{\text{ref}} are available, the RL objective via importance sampling is:

J(θ)=Eτπref[p(τ;θ)πref(τ)R(τ)].J(\theta) = \mathbb{E}_{\tau \sim \pi_{\text{ref}}}\left[ \frac{p(\tau;\theta)}{\pi_{\text{ref}}(\tau)} R(\tau) \right].

Using the bound x1+logxx \geq 1 + \log x for x>0x > 0, SFT arises as:

J(θ)Eτπref[R(τ)logp(τ;θ)]+const.J(\theta) \geq \mathbb{E}_{\tau \sim \pi_{\text{ref}}}[R(\tau) \log p(\tau;\theta)] + \text{const.}

This connects SFT to the RL surrogate, but the bound loosens as the policy πθ\pi_\theta diverges from πref\pi_{\text{ref}}.

The iw-SFT surrogate objective obtains a tighter lower bound by introducing a proposal q(τ)q(\tau) and rewriting:

J(θ)Eτπref[q(τ)πref(τ)logp(τ;θ)]+const.,J(\theta) \geq \mathbb{E}_{\tau \sim \pi_{\text{ref}}}\left[ \frac{q(\tau)}{\pi_{\text{ref}}(\tau)} \log p(\tau; \theta)\right] + \text{const.},

so that the importance weight is w(τ)=q(τ)/πref(τ)w(\tau) = q(\tau)/\pi_{\text{ref}}(\tau). This construction ensures the surrogate approaches exact RL as qπθq \rightarrow \pi_\theta, albeit with increased estimator variance. In practice, qq is selected as a lagged or smoothed copy of πθ\pi_\theta to control bias-variance trade-offs (Qin et al., 17 Jul 2025).

2. Variance Control and Implementation Strategies

Naive importance weighting incurs high estimator variance, particularly when the policy gap D(πbπθ)D(\pi_b || \pi_\theta) is large, leading to rapidly exploding weights and unstable optimization. Conventional mitigation includes:

  • KL penalties/trust region methods (e.g., PPO, TRPO): constrain πθ\pi_\theta to stay close to πb\pi_b, without changing the data distribution.
  • Clipping: enforces wi[c1,c]w_i \in [c^{-1}, c], trading bias for bounded variance but leaving the data distribution unchanged.

A more direct reduction of policy gap at the data level is achieved by data rewriting. In this paradigm, supervised examples are partitioned into self-aligned (on-policy), guided-retell (near-policy), and fallback (off-policy) instances. For self-alignment, K samples from πθ\pi_\theta are generated and, if any matches the criterion for correctness, included as on-policy data. If self-alignment fails, a digest-and-retell prompt is used to guide the model to restate expert demonstrations, retaining these as near-policy data as appropriate. Only if both mechanisms fail does the fallback to original expert labels occur. This process constructs a new mixture distribution πmix\pi_{\text{mix}} satisfying D(πmixπθ)<D(πbπθ)D(\pi_{\text{mix}}||\pi_\theta) < D(\pi_b||\pi_\theta), directly reducing variance in subsequent iw-SFT (Zhao et al., 18 Sep 2025).

3. Generalizations and Extensions

Quality-Scored Data

When trajectories are assigned scalar quality scores S(τ)S(\tau), iw-SFT naturally generalizes by forming datasets Dci+={τ:S(τ)>ci}D_{c_i}^+ = \{\tau : S(\tau) > c_i\} and sampling from their union DQ+D_Q^+. The importance-weighted objective extends accordingly. This allows graded supervision to be integrated into iw-SFT, refining alignment between the surrogate and true RL return (Qin et al., 17 Jul 2025).

Token- and Group-Level Weighting

Variants such as SFT-GO assign importance weights on a per-token basis by segmenting sequences into groups (e.g., via TF-IDF, semantic, or excess-loss metrics). The group-based optimization objective is

LGO(w;θ)=(1λ)LCE(w;θ)+λLworst(w;θ,g),L_{GO}(w; \theta) = (1 - \lambda) L_{CE}(w; \theta) + \lambda L_{\text{worst}}(w; \theta, g),

where LworstL_{\text{worst}} is the maximum cross-entropy loss among token groups, and λ\lambda anneals over training (Kim et al., 17 Jun 2025). This construction emphasizes challenging or salient regions of each sequence, boosting worst-group performance and yielding O(1/√T) convergence under convexity assumptions.

Anchored iw-SFT (ASFT) and Dynamic Fine-Tuning (DFT)

Reward-weighted regression places iw-SFT and DFT on a unified spectrum, varying the auxiliary distribution qq. In DFT, qDFT(yx)πref(yx)pθ(yx)q_{\text{DFT}}(y|x) \propto \pi_{\text{ref}}(y|x) \cdot p_\theta(y|x), so that weighting is proportional to current model probabilities, which leads to dynamic focusing on likely outputs but can incur instability due to drift away from πref\pi_{\text{ref}}. ASFT introduces a reverse KL regularization, penalizing divergence from a frozen base model, and retains both tightness of the RL lower bound and training stability (Zhu et al., 28 Sep 2025).

4. Algorithmic Details and Practical Considerations

A summary of the iw-SFT workflow:

Step Description Reference
Proposal qq construction Lagged, EMA, or smoothed copy of πθ\pi_\theta; in DFT, qπrefpθq \propto \pi_{\text{ref}} p_\theta (Qin et al., 17 Jul 2025, Zhu et al., 28 Sep 2025)
Importance ratio eval Trajectory-level or token-level log-ratios, possibly clipped or smoothed (Qin et al., 17 Jul 2025)
Weighted loss Jiw-SFT(θ)=ED+[w(τ)logp(τ;θ)]J_{\text{iw-SFT}}(\theta) = \mathbb{E}_{D^+}[w(\tau) \log p(\tau;\theta)] (Qin et al., 17 Jul 2025)
Data rewriting Preprocess to align distribution, reducing variance (Zhao et al., 18 Sep 2025)
Regularization KL-anchoring (ASFT), temperature scaling, group DRO, or clipping as needed (Zhu et al., 28 Sep 2025, Kim et al., 17 Jun 2025)

Empirically, incorporating proposal updates, bounded ratios, and data rewriting yields more robust and stable training. The additional computational cost is typically a single reference/proposal forward pass and minimal memory overhead due to lagged copies.

5. Empirical Performance and Benchmarks

iw-SFT and its variants show substantial performance improvements across reasoning, language modeling, and offline control:

  • Language modeling/reasoning (AIME, MATH 500, GPQA): For Qwen2.5-32B-Instruct on the S1.1K benchmark, SFT achieved (AIME: 56.7%, MATH 500: 95.4%, GPQA: 63.6%) while iw-SFT attained (AIME: 66.7%, MATH 500: 94.8%, GPQA: 64.1%). Full-sequence weighting outperformed per-step weighting (Qin et al., 17 Jul 2025).
  • Mathematical reasoning (Math500, Minerva Math, OlympiadBench, etc.): Data rewriting + DFT increased accuracy from SFT 23.23% to 42.03% on Qwen2.5-Math-7B (Zhao et al., 18 Sep 2025).
  • Continuous control (D4RL): On Hopper-Medium-Replay, SFT achieved 79.0, iw-SFT(Q) 85.0. On Walker2d-Medium-Replay, SFT 58.8, iw-SFT(Q) 75.8 (Qin et al., 17 Jul 2025).
  • Instruction tuning: SFT-GO increased Llama-3.1-8B’s average performance from 45.12 to 47.33 points using LLMLingua-2 grouping and from 50.02 to 51.21 on Alpaca data (Kim et al., 17 Jun 2025).
  • Anchored SFT (ASFT): On 10k medical data, SFT 33.37% \rightarrow ASFT 42.03%; on math (100k), SFT 19.15% \rightarrow ASFT 30.50% (Zhu et al., 28 Sep 2025).

These results collectively demonstrate that iw-SFT closes much of the performance gap to more complex RL-based post-training while remaining efficient and stable.

6. Limitations and Stability Challenges

The principal technical challenge in iw-SFT is estimator variance due to large policy gaps. Data rewriting and KL-based anchoring directly address variance and drift instability by reshaping or constraining the data/model distribution. Alternative methods that rely solely on clipping or passive regularization are less effective, as they do not reduce divergence between data-generating and target policies. The effectiveness of each approach is highly sensitive to the choice of proposal distribution qq, the method for computing or approximating πb\pi_b in off-policy data, the group assignments in DRO-based variants, and the annealing schedule for hyperparameters controlling regularization.

7. Outlook and Practical Recommendations

Importance-weighted SFT provides a theoretically justified, practically robust enhancement to classical supervised fine-tuning, offering a continuum between pure SFT and off-policy RL. Variants such as data-rewritten iw-SFT and anchored DFT (ASFT) combine the sample efficiency and stability of SFT with the performance benefits of RL. Practical guidelines recommend using lagged proposal policies for qq, KL-anchoring for drift control, and aggressive data rewriting to minimize variance in off-policy settings. Temperature annealing and group-based weighting further enhance convergence and robustness. These techniques render iw-SFT highly adaptable to domains where reward supervision is sparse, label distributions are highly non-uniform, and performance on challenging or rare subsets is critical (Qin et al., 17 Jul 2025, Zhao et al., 18 Sep 2025, Kim et al., 17 Jun 2025, Zhu et al., 28 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Importance-Weighted Supervised Fine-Tuning (iw-SFT).