Papers
Topics
Authors
Recent
2000 character limit reached

Importance-Weighted SFT (iw-SFT)

Updated 8 December 2025
  • iw-SFT is a framework that reweights supervised fine-tuning using importance sampling to derive sample-specific weights from rewards, uncertainty, or distribution shifts.
  • The method tightens the theoretical connection to reinforcement learning by maximizing a tighter lower bound on the RL objective, matching or exceeding RLHF performance.
  • Empirical results across LLMs, diffusion models, and control tasks demonstrate significant gains in efficiency, robustness, and data effectiveness.

Importance-Weighted Supervised Fine-Tuning (iw-SFT) is a principled modification to standard supervised fine-tuning (SFT) that leverages sample-specific weights derived from importance sampling, reward estimation, prediction uncertainty, or distribution shift. This framework tightens the theoretical connection between SFT and reinforcement learning (RL), providing a tighter lower bound to the RL objective than classical SFT. Empirically, iw-SFT matches or exceeds the performance of advanced RL or RLHF methods in both language modeling and control, while requiring only supervised updates. Recent research demonstrates multiple instantiations of iw-SFT across LLMs, diffusion models, and imitation learning domains, each with formal derivations, explicit algorithms, and practical empirical gains.

1. Theoretical Foundations

The central insight underpinning iw-SFT is that standard SFT, when performed on curated demonstration data, can be viewed as maximizing a lower bound on the expected RL return in a sparse reward regime. Consider a trajectory τ\tau (e.g., a token sequence xx) and reward R(τ)R(\tau). For a policy πθ\pi_\theta, the RL objective is

J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.

If only trajectories sampled from a reference policy πref\pi_\text{ref} are available, importance sampling yields

J(θ)=πref(τ)p(τ;θ)πref(τ)R(τ)dτ.J(\theta) = \int \pi_\text{ref}(\tau) \frac{p(\tau;\theta)}{\pi_\text{ref}(\tau)} R(\tau) d\tau.

A Jensen-type lower bound (using x1+logxx \geq 1 + \log x) applied to the importance ratio gives

J(θ)πref(τ)R(τ)[1+logp(τ;θ)πref(τ)]dτ=const+Eπref[R(τ)logp(τ;θ)].J(\theta) \geq \int \pi_\text{ref}(\tau) R(\tau) [1 + \log \frac{p(\tau;\theta)}{\pi_\text{ref}(\tau)}] d\tau = \text{const} + \mathbb{E}_{\pi_\text{ref}}[R(\tau) \log p(\tau;\theta)].

When R(τ)R(\tau) is a binary indicator over a curated set D+D^+, this reduces to the familiar SFT loss (up to additive constant): LSFT(θ)=EτD+[logp(τ;θ)].L_\text{SFT}(\theta) = \mathbb{E}_{\tau \in D^+}[\log p(\tau; \theta)]. Introducing an auxiliary distribution q(τ)q(\tau) and reapplying the bound leads to a generalized importance-weighted loss: Liw-SFT(θ)=EτD+[w(τ)logp(τ;θ)],L_\text{iw-SFT}(\theta) = \mathbb{E}_{\tau \in D^+}[w(\tau) \log p(\tau; \theta)], where w(τ)=q(τ)/πref(τ)w(\tau) = q(\tau) / \pi_\text{ref}(\tau). As qπθq \to \pi_\theta, this bound approaches the true RL return J(θ)J(\theta) (Qin et al., 17 Jul 2025).

2. Algorithmic Instantiations

Several implementations of iw-SFT have been proposed, each adapted for different domains and sources of signal for importance weights:

  1. Sequence-level iw-SFT (standard RL/SFT connection):
    • Maintain a slow-moving copy θq\theta_q of model parameters for computing qq.
    • For each trajectory τj\tau_j in a batch, compute Δj=t[logπθq(atst)logπref(atst)]\Delta \ell_j = \sum_t [\log \pi_{\theta_q}(a_t|s_t) - \log \pi_\text{ref}(a_t|s_t)].
    • Weight: wj=exp(tg(logπθq/πref))w_j = \exp(\sum_t g(\log \pi_{\theta_q}/\pi_\text{ref})), with smoothing/clipping g()g(\cdot).
    • Update main model via weighted log-likelihood gradient.
    • Optionally update θqθ\theta_q \leftarrow \theta periodically (Qin et al., 17 Jul 2025).
  2. Reward-based iw-SFT via Inverse RL:
    • Learn a reward model rϕ(x,y)r_\phi(x, y) from demonstrations through a maximum-entropy IRL process.
    • Compute for each example

    w(x,y)=exp(rϕ(x,y)/β)Ey[exp(rϕ(x,y)/β)].w(x,y) = \frac{\exp(r_\phi(x, y)/\beta)}{\mathbb{E}_{y'}[\exp(r_\phi(x, y')/\beta)]}.

  1. Token-level iw-SFT for diffusion LLMs (WeFT):

    • For each token ii, compute entropy H(pi)H(p_i) of model's predictive distribution.
    • Assign per-token importance βi=H(pi)\beta_i = \sqrt{H(p_i)}.
    • Masking probability ti=1(1t)βi/βreft_i = 1 - (1 - t)^{\beta_i / \beta_\text{ref}}, where tUniform[0,1]t \sim \text{Uniform}[0,1].
    • Weight the loss on each token by wi=1/tiw_i = 1/t_i (Xu et al., 25 Sep 2025).
  2. Distribution shift-based weighting for self-generated data:
    • Define "DS weight" wiDSw_i^\text{DS} using a held-out validation set and model loss statistics:

    wi=jL(ML(xj))NvL(ML(xi)),w_i' = \frac{\sum_{j} \mathcal{L}(M_L(x_j))}{N_v \cdot \mathcal{L}(M_L(x_i))},

    wiDS={wi,wi1 1/wi,wi<1w_i^\text{DS} = \begin{cases} w_i', & w_i' \geq 1 \ 1/w_i', & w_i' < 1 \end{cases}

3. Generalization to Quality-Scored Data

iw-SFT readily extends to situations where data points are associated with real-valued or ordinal quality scores s(τ)>0s(\tau) > 0:

  • Sampling-based ("SFT(Q)"): Sample examples proportional to s(τ)s(\tau), optimizing an expected log-likelihood objective.

  • Weighted loss: Attach normalized weight w(τ)=s(τ)/ED[s]w(\tau) = s(\tau) / \mathbb{E}_D[s] and optimize

LiwSFT(Q)(θ)=EτD[w(τ)logp(τ;θ)].L_{\mathrm{iw-SFT(Q)}}(\theta) = \mathbb{E}_{\tau \sim D}[w(\tau) \log p(\tau;\theta)].

  • Both mechanisms can be combined for doubly weighted objectives. In discrete quality settings, scores can be stratified and used as rewards in the iw-SFT formulation (Qin et al., 17 Jul 2025).

4. Implementation Considerations

iw-SFT introduces specific considerations distinct from standard SFT:

  • Reference policy (πref\pi_\text{ref}):

    • Typically set to the initial model checkpoint if the data-generating policy is unavailable.
  • Numerical stability:
    • Apply clipping or smoothing to log-ratio computations or weights, e.g., g(x)=kclip(x,xmin,xmax)g(x) = k \cdot \text{clip}(x, x_\text{min}, x_\text{max}).
    • Normalize or cap final weights wjw_j within a specified range.
    • Introduce optional KL-constraints between θq\theta_q and πref\pi_\text{ref} to mitigate excessive variance.
  • Batching and computation:
    • Process full sequences to accumulate log-ratios before exponentiating.
    • Sequence-level weighting typically outperforms token-level in LLMs, whereas in diffusion models (WeFT), token-level entropy-driven weighting is dominant (Xu et al., 25 Sep 2025).
  • Hyperparameters:
    • Update frequency for importance model (θq\theta_q).
    • Clipping parameters (αmin\alpha_\text{min}, αmax\alpha_\text{max}), smoothing temperature kk.
    • For reward-based methods, temperature β\beta and clipping bounds for reward-induced weights (Li et al., 28 May 2024).

5. Empirical Performance Across Domains

Importance weighting in SFT consistently yields improved empirical results:

  • LLM reasoning:
    • On AIME-2024, standard SFT (Qwen2.5-32B-Instruct) achieves \sim56.7% accuracy; iw-SFT achieves 66.7%, closing half the performance gap to RL-tuned (proprietary) models.
    • Similar improvements observed on MATH500 (94.4% → 94.8%) and GPQA (60.6% → 64.1%) (Qin et al., 17 Jul 2025).
  • Continuous control (D4RL):
    • SFT(Q) on top-10% trajectories already outperforms BC and matches RL methods such as AWAC, TD3+BC, CQL, and IQL; iw-SFT(Q) further improves (e.g., Walker2D Medium-Replay: 66→75) (Qin et al., 17 Jul 2025).
    • Near-expert performance is reached on "Expert" data in all settings.
  • Data efficiency and robustness:
    • In low-data regimes (e.g., Franka Kitchen), iw-SFT(Q) yields 62% task completion using only 5% of expert data, outperforming BC (29%), SFT(5%) (46%), and SFT(Q) (58%) (Qin et al., 17 Jul 2025).
  • Diffusion LLMs:
    • WeFT (token-entropy iw-SFT) yields relative gains of 39%-83% on Sudoku, Countdown, GSM8K, and MATH-500 compared to SFT on identical budgets (Xu et al., 25 Sep 2025).
  • Alignment and reward learning:
    • Reward-model-based iw-SFT increases average benchmark scores from 59.48% to 61.03% on LLMs (7B parameter scale) (Li et al., 28 May 2024).
  • LLM self-improvement:
    • Distribution shift-based iw-SFT variant matches the gains of reward-model supervision in bootstrapping LLMs, improving average task accuracy from 34.0% (LMSI) to 40.4% (IWSI filtering) (Jiang et al., 19 Aug 2024).

6. Practical Impact and Future Extensions

iw-SFT provides a minimal-complexity route to exploit RL concepts in supervised updates:

  • Empirical proximity to RLHF: iw-SFT matches or exceeds full RLHF pipelines in several LLM and control settings, under purely supervised loss modification.
  • Model-agnostic weighting: Entropy, reward, distribution shift, or auxiliary estimators can serve as sources of importance weights, allowing broad adaptation.
  • Continued research directions: Extensions include learned density ratio estimation for better distribution shift weights, adaptive schedules for entropy-weighting in multi-stage SFT→RL→distillation pipelines, and trust-region or KL-constrained variants for variance control (Qin et al., 17 Jul 2025, Xu et al., 25 Sep 2025, Jiang et al., 19 Aug 2024).
  • Robustness to quality and domain drift: Filtering and weighting by true or surrogate importance mitigate the risk of model collapse from semantically spurious, noisy, or high-shift samples.

7. Comparison Table of Core iw-SFT Variants

Variant Weight Signal Primary Domain Key Reference
RL Lower Bound q(τ)/πref(τ)q(\tau)/\pi_\text{ref}(\tau) LLMs, control (Qin et al., 17 Jul 2025)
Reward-Model erϕ(x,y)/βe^{r_\phi(x, y)/\beta} LLM alignment (Li et al., 28 May 2024)
Token Entropy (WeFT) 1/ti1/t_i for tiH(pi)t_i \propto \sqrt{H(p_i)} Diffusion LLMs (Xu et al., 25 Sep 2025)
DS-Weight Filtering Empirical wiDSw_i^\text{DS} Self-improving LLMs (Jiang et al., 19 Aug 2024)

Each instantiation derives from the common principle of aligning the SFT objective more closely with the true RL objective or desired data distribution by non-uniform sample weighting. The particular signal (reward, uncertainty, distribution density) and practical weighting scheme depend on domain, learning modality, and computational considerations.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Importance-Weighted SFT (iw-SFT).