Papers
Topics
Authors
Recent
Search
2000 character limit reached

Importance-Weighted SFT (iw-SFT)

Updated 8 December 2025
  • iw-SFT is a framework that reweights supervised fine-tuning using importance sampling to derive sample-specific weights from rewards, uncertainty, or distribution shifts.
  • The method tightens the theoretical connection to reinforcement learning by maximizing a tighter lower bound on the RL objective, matching or exceeding RLHF performance.
  • Empirical results across LLMs, diffusion models, and control tasks demonstrate significant gains in efficiency, robustness, and data effectiveness.

Importance-Weighted Supervised Fine-Tuning (iw-SFT) is a principled modification to standard supervised fine-tuning (SFT) that leverages sample-specific weights derived from importance sampling, reward estimation, prediction uncertainty, or distribution shift. This framework tightens the theoretical connection between SFT and reinforcement learning (RL), providing a tighter lower bound to the RL objective than classical SFT. Empirically, iw-SFT matches or exceeds the performance of advanced RL or RLHF methods in both language modeling and control, while requiring only supervised updates. Recent research demonstrates multiple instantiations of iw-SFT across LLMs, diffusion models, and imitation learning domains, each with formal derivations, explicit algorithms, and practical empirical gains.

1. Theoretical Foundations

The central insight underpinning iw-SFT is that standard SFT, when performed on curated demonstration data, can be viewed as maximizing a lower bound on the expected RL return in a sparse reward regime. Consider a trajectory τ\tau (e.g., a token sequence xx) and reward R(τ)R(\tau). For a policy πθ\pi_\theta, the RL objective is

J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.

If only trajectories sampled from a reference policy πref\pi_\text{ref} are available, importance sampling yields

J(θ)=πref(τ)p(τ;θ)πref(τ)R(τ)dτ.J(\theta) = \int \pi_\text{ref}(\tau) \frac{p(\tau;\theta)}{\pi_\text{ref}(\tau)} R(\tau) d\tau.

A Jensen-type lower bound (using x1+logxx \geq 1 + \log x) applied to the importance ratio gives

J(θ)πref(τ)R(τ)[1+logp(τ;θ)πref(τ)]dτ=const+Eπref[R(τ)logp(τ;θ)].J(\theta) \geq \int \pi_\text{ref}(\tau) R(\tau) [1 + \log \frac{p(\tau;\theta)}{\pi_\text{ref}(\tau)}] d\tau = \text{const} + \mathbb{E}_{\pi_\text{ref}}[R(\tau) \log p(\tau;\theta)].

When R(τ)R(\tau) is a binary indicator over a curated set xx0, this reduces to the familiar SFT loss (up to additive constant): xx1 Introducing an auxiliary distribution xx2 and reapplying the bound leads to a generalized importance-weighted loss: xx3 where xx4. As xx5, this bound approaches the true RL return xx6 (Qin et al., 17 Jul 2025).

2. Algorithmic Instantiations

Several implementations of iw-SFT have been proposed, each adapted for different domains and sources of signal for importance weights:

  1. Sequence-level iw-SFT (standard RL/SFT connection):
    • Maintain a slow-moving copy xx7 of model parameters for computing xx8.
    • For each trajectory xx9 in a batch, compute R(τ)R(\tau)0.
    • Weight: R(τ)R(\tau)1, with smoothing/clipping R(τ)R(\tau)2.
    • Update main model via weighted log-likelihood gradient.
    • Optionally update R(τ)R(\tau)3 periodically (Qin et al., 17 Jul 2025).
  2. Reward-based iw-SFT via Inverse RL:
    • Learn a reward model R(τ)R(\tau)4 from demonstrations through a maximum-entropy IRL process.
    • Compute for each example

    R(τ)R(\tau)5

  1. Token-level iw-SFT for diffusion LLMs (WeFT):

    • For each token R(τ)R(\tau)6, compute entropy R(τ)R(\tau)7 of model's predictive distribution.
    • Assign per-token importance R(τ)R(\tau)8.
    • Masking probability R(τ)R(\tau)9, where πθ\pi_\theta0.
    • Weight the loss on each token by πθ\pi_\theta1 (Xu et al., 25 Sep 2025).
  2. Distribution shift-based weighting for self-generated data:
    • Define "DS weight" πθ\pi_\theta2 using a held-out validation set and model loss statistics:

    πθ\pi_\theta3

    πθ\pi_\theta4

  • Filter or weight generated samples by πθ\pi_\theta5 before SFT (Jiang et al., 2024).

3. Generalization to Quality-Scored Data

iw-SFT readily extends to situations where data points are associated with real-valued or ordinal quality scores πθ\pi_\theta6:

  • Sampling-based ("SFT(Q)"): Sample examples proportional to πθ\pi_\theta7, optimizing an expected log-likelihood objective.

  • Weighted loss: Attach normalized weight πθ\pi_\theta8 and optimize

πθ\pi_\theta9

  • Both mechanisms can be combined for doubly weighted objectives. In discrete quality settings, scores can be stratified and used as rewards in the iw-SFT formulation (Qin et al., 17 Jul 2025).

4. Implementation Considerations

iw-SFT introduces specific considerations distinct from standard SFT:

  • Reference policy (J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.0):

    • Typically set to the initial model checkpoint if the data-generating policy is unavailable.
  • Numerical stability:
    • Apply clipping or smoothing to log-ratio computations or weights, e.g., J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.1.
    • Normalize or cap final weights J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.2 within a specified range.
    • Introduce optional KL-constraints between J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.3 and J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.4 to mitigate excessive variance.
  • Batching and computation:
    • Process full sequences to accumulate log-ratios before exponentiating.
    • Sequence-level weighting typically outperforms token-level in LLMs, whereas in diffusion models (WeFT), token-level entropy-driven weighting is dominant (Xu et al., 25 Sep 2025).
  • Hyperparameters:
    • Update frequency for importance model (J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.5).
    • Clipping parameters (J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.6, J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.7), smoothing temperature J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.8.
    • For reward-based methods, temperature J(θ)=Eπθ[R(τ)]=p(τ;θ)R(τ)dτ.J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.9 and clipping bounds for reward-induced weights (Li et al., 2024).

5. Empirical Performance Across Domains

Importance weighting in SFT consistently yields improved empirical results:

  • LLM reasoning:
    • On AIME-2024, standard SFT (Qwen2.5-32B-Instruct) achieves πref\pi_\text{ref}056.7% accuracy; iw-SFT achieves 66.7%, closing half the performance gap to RL-tuned (proprietary) models.
    • Similar improvements observed on MATH500 (94.4% → 94.8%) and GPQA (60.6% → 64.1%) (Qin et al., 17 Jul 2025).
  • Continuous control (D4RL):
    • SFT(Q) on top-10% trajectories already outperforms BC and matches RL methods such as AWAC, TD3+BC, CQL, and IQL; iw-SFT(Q) further improves (e.g., Walker2D Medium-Replay: 66→75) (Qin et al., 17 Jul 2025).
    • Near-expert performance is reached on "Expert" data in all settings.
  • Data efficiency and robustness:
    • In low-data regimes (e.g., Franka Kitchen), iw-SFT(Q) yields 62% task completion using only 5% of expert data, outperforming BC (29%), SFT(5%) (46%), and SFT(Q) (58%) (Qin et al., 17 Jul 2025).
  • Diffusion LLMs:
    • WeFT (token-entropy iw-SFT) yields relative gains of 39%-83% on Sudoku, Countdown, GSM8K, and MATH-500 compared to SFT on identical budgets (Xu et al., 25 Sep 2025).
  • Alignment and reward learning:
    • Reward-model-based iw-SFT increases average benchmark scores from 59.48% to 61.03% on LLMs (7B parameter scale) (Li et al., 2024).
  • LLM self-improvement:
    • Distribution shift-based iw-SFT variant matches the gains of reward-model supervision in bootstrapping LLMs, improving average task accuracy from 34.0% (LMSI) to 40.4% (IWSI filtering) (Jiang et al., 2024).

6. Practical Impact and Future Extensions

iw-SFT provides a minimal-complexity route to exploit RL concepts in supervised updates:

  • Empirical proximity to RLHF: iw-SFT matches or exceeds full RLHF pipelines in several LLM and control settings, under purely supervised loss modification.
  • Model-agnostic weighting: Entropy, reward, distribution shift, or auxiliary estimators can serve as sources of importance weights, allowing broad adaptation.
  • Continued research directions: Extensions include learned density ratio estimation for better distribution shift weights, adaptive schedules for entropy-weighting in multi-stage SFT→RL→distillation pipelines, and trust-region or KL-constrained variants for variance control (Qin et al., 17 Jul 2025, Xu et al., 25 Sep 2025, Jiang et al., 2024).
  • Robustness to quality and domain drift: Filtering and weighting by true or surrogate importance mitigate the risk of model collapse from semantically spurious, noisy, or high-shift samples.

7. Comparison Table of Core iw-SFT Variants

Variant Weight Signal Primary Domain Key Reference
RL Lower Bound πref\pi_\text{ref}1 LLMs, control (Qin et al., 17 Jul 2025)
Reward-Model πref\pi_\text{ref}2 LLM alignment (Li et al., 2024)
Token Entropy (WeFT) πref\pi_\text{ref}3 for πref\pi_\text{ref}4 Diffusion LLMs (Xu et al., 25 Sep 2025)
DS-Weight Filtering Empirical πref\pi_\text{ref}5 Self-improving LLMs (Jiang et al., 2024)

Each instantiation derives from the common principle of aligning the SFT objective more closely with the true RL objective or desired data distribution by non-uniform sample weighting. The particular signal (reward, uncertainty, distribution density) and practical weighting scheme depend on domain, learning modality, and computational considerations.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Importance-Weighted SFT (iw-SFT).