Importance-Weighted SFT (iw-SFT)

Updated 8 December 2025

iw-SFT is a framework that reweights supervised fine-tuning using importance sampling to derive sample-specific weights from rewards, uncertainty, or distribution shifts.
The method tightens the theoretical connection to reinforcement learning by maximizing a tighter lower bound on the RL objective, matching or exceeding RLHF performance.
Empirical results across LLMs, diffusion models, and control tasks demonstrate significant gains in efficiency, robustness, and data effectiveness.

Importance-Weighted Supervised Fine-Tuning (iw-SFT) is a principled modification to standard supervised fine-tuning (SFT) that leverages sample-specific weights derived from importance sampling, reward estimation, prediction uncertainty, or distribution shift. This framework tightens the theoretical connection between SFT and reinforcement learning (RL), providing a tighter lower bound to the RL objective than classical SFT. Empirically, iw-SFT matches or exceeds the performance of advanced RL or RLHF methods in both language modeling and control, while requiring only supervised updates. Recent research demonstrates multiple instantiations of iw-SFT across LLMs, diffusion models, and imitation learning domains, each with formal derivations, explicit algorithms, and practical empirical gains.

1. Theoretical Foundations

The central insight underpinning iw-SFT is that standard SFT, when performed on curated demonstration data, can be viewed as maximizing a lower bound on the expected RL return in a sparse reward regime. Consider a trajectory $\tau$ (e.g., a token sequence $x$ ) and reward $R(\tau)$ . For a policy $\pi_\theta$ , the RL objective is

$J(\theta) = \mathbb{E}_{\pi_\theta}[R(\tau)] = \int p(\tau;\theta) R(\tau) d\tau.$

If only trajectories sampled from a reference policy $\pi_\text{ref}$ are available, importance sampling yields

$J(\theta) = \int \pi_\text{ref}(\tau) \frac{p(\tau;\theta)}{\pi_\text{ref}(\tau)} R(\tau) d\tau.$

A Jensen-type lower bound (using $x \geq 1 + \log x$ ) applied to the importance ratio gives

$J(\theta) \geq \int \pi_\text{ref}(\tau) R(\tau) [1 + \log \frac{p(\tau;\theta)}{\pi_\text{ref}(\tau)}] d\tau = \text{const} + \mathbb{E}_{\pi_\text{ref}}[R(\tau) \log p(\tau;\theta)].$

When $R(\tau)$ is a binary indicator over a curated set $D^+$ , this reduces to the familiar SFT loss (up to additive constant): $L_\text{SFT}(\theta) = \mathbb{E}_{\tau \in D^+}[\log p(\tau; \theta)].$ Introducing an auxiliary distribution $q(\tau)$ and reapplying the bound leads to a generalized importance-weighted loss: $L_\text{iw-SFT}(\theta) = \mathbb{E}_{\tau \in D^+}[w(\tau) \log p(\tau; \theta)],$ where $w(\tau) = q(\tau) / \pi_\text{ref}(\tau)$ . As $q \to \pi_\theta$ , this bound approaches the true RL return $J(\theta)$ (Qin et al., 17 Jul 2025).

2. Algorithmic Instantiations

Several implementations of iw-SFT have been proposed, each adapted for different domains and sources of signal for importance weights:

Sequence-level iw-SFT (standard RL/SFT connection):
- Maintain a slow-moving copy $\theta_q$ of model parameters for computing $q$ .
- For each trajectory $\tau_j$ in a batch, compute $\Delta \ell_j = \sum_t [\log \pi_{\theta_q}(a_t|s_t) - \log \pi_\text{ref}(a_t|s_t)]$ .
- Weight: $w_j = \exp(\sum_t g(\log \pi_{\theta_q}/\pi_\text{ref}))$ , with smoothing/clipping $g(\cdot)$ .
- Update main model via weighted log-likelihood gradient.
- Optionally update $\theta_q \leftarrow \theta$ periodically (Qin et al., 17 Jul 2025).
Reward-based iw-SFT via Inverse RL:
- Learn a reward model $r_\phi(x, y)$ from demonstrations through a maximum-entropy IRL process.
- Compute for each example
$w(x,y) = \frac{\exp(r_\phi(x, y)/\beta)}{\mathbb{E}_{y'}[\exp(r_\phi(x, y')/\beta)]}.$

Minimize the weighted negative log-likelihood of the data (Li et al., 28 May 2024).

Token-level iw-SFT for diffusion LLMs (WeFT):
- For each token $i$ , compute entropy $H(p_i)$ of model's predictive distribution.
- Assign per-token importance $\beta_i = \sqrt{H(p_i)}$ .
- Masking probability $t_i = 1 - (1 - t)^{\beta_i / \beta_\text{ref}}$ , where $t \sim \text{Uniform}[0,1]$ .
- Weight the loss on each token by $w_i = 1/t_i$ (Xu et al., 25 Sep 2025).
Distribution shift-based weighting for self-generated data:
- Define "DS weight" $w_i^\text{DS}$ using a held-out validation set and model loss statistics:
$w_i' = \frac{\sum_{j} \mathcal{L}(M_L(x_j))}{N_v \cdot \mathcal{L}(M_L(x_i))},$

$w_i^\text{DS} = \begin{cases} w_i', & w_i' \geq 1 \ 1/w_i', & w_i' < 1 \end{cases}$

Filter or weight generated samples by $w_i^\text{DS}$ before SFT (Jiang et al., 19 Aug 2024).

3. Generalization to Quality-Scored Data

iw-SFT readily extends to situations where data points are associated with real-valued or ordinal quality scores $s(\tau) > 0$ :

Sampling-based ("SFT(Q)"): Sample examples proportional to $s(\tau)$ , optimizing an expected log-likelihood objective.
Weighted loss: Attach normalized weight $w(\tau) = s(\tau) / \mathbb{E}_D[s]$ and optimize

$L_{\mathrm{iw-SFT(Q)}}(\theta) = \mathbb{E}_{\tau \sim D}[w(\tau) \log p(\tau;\theta)].$

Both mechanisms can be combined for doubly weighted objectives. In discrete quality settings, scores can be stratified and used as rewards in the iw-SFT formulation (Qin et al., 17 Jul 2025).

4. Implementation Considerations

iw-SFT introduces specific considerations distinct from standard SFT:

Reference policy ( $\pi_\text{ref}$ ):
- Typically set to the initial model checkpoint if the data-generating policy is unavailable.
Numerical stability:
- Apply clipping or smoothing to log-ratio computations or weights, e.g., $g(x) = k \cdot \text{clip}(x, x_\text{min}, x_\text{max})$ .
- Normalize or cap final weights $w_j$ within a specified range.
- Introduce optional KL-constraints between $\theta_q$ and $\pi_\text{ref}$ to mitigate excessive variance.
Batching and computation:
- Process full sequences to accumulate log-ratios before exponentiating.
- Sequence-level weighting typically outperforms token-level in LLMs, whereas in diffusion models (WeFT), token-level entropy-driven weighting is dominant (Xu et al., 25 Sep 2025).
Hyperparameters:
- Update frequency for importance model ( $\theta_q$ ).
- Clipping parameters ( $\alpha_\text{min}$ , $\alpha_\text{max}$ ), smoothing temperature $k$ .
- For reward-based methods, temperature $\beta$ and clipping bounds for reward-induced weights (Li et al., 28 May 2024).

5. Empirical Performance Across Domains

Importance weighting in SFT consistently yields improved empirical results:

LLM reasoning:
- On AIME-2024, standard SFT (Qwen2.5-32B-Instruct) achieves $\sim$ 56.7% accuracy; iw-SFT achieves 66.7%, closing half the performance gap to RL-tuned (proprietary) models.
- Similar improvements observed on MATH500 (94.4% → 94.8%) and GPQA (60.6% → 64.1%) (Qin et al., 17 Jul 2025).
Continuous control (D4RL):
- SFT(Q) on top-10% trajectories already outperforms BC and matches RL methods such as AWAC, TD3+BC, CQL, and IQL; iw-SFT(Q) further improves (e.g., Walker2D Medium-Replay: 66→75) (Qin et al., 17 Jul 2025).
- Near-expert performance is reached on "Expert" data in all settings.
Data efficiency and robustness:
- In low-data regimes (e.g., Franka Kitchen), iw-SFT(Q) yields 62% task completion using only 5% of expert data, outperforming BC (29%), SFT(5%) (46%), and SFT(Q) (58%) (Qin et al., 17 Jul 2025).
Diffusion LLMs:
- WeFT (token-entropy iw-SFT) yields relative gains of 39%-83% on Sudoku, Countdown, GSM8K, and MATH-500 compared to SFT on identical budgets (Xu et al., 25 Sep 2025).
Alignment and reward learning:
- Reward-model-based iw-SFT increases average benchmark scores from 59.48% to 61.03% on LLMs (7B parameter scale) (Li et al., 28 May 2024).
LLM self-improvement:
- Distribution shift-based iw-SFT variant matches the gains of reward-model supervision in bootstrapping LLMs, improving average task accuracy from 34.0% (LMSI) to 40.4% (IWSI filtering) (Jiang et al., 19 Aug 2024).

6. Practical Impact and Future Extensions

iw-SFT provides a minimal-complexity route to exploit RL concepts in supervised updates:

Empirical proximity to RLHF: iw-SFT matches or exceeds full RLHF pipelines in several LLM and control settings, under purely supervised loss modification.
Model-agnostic weighting: Entropy, reward, distribution shift, or auxiliary estimators can serve as sources of importance weights, allowing broad adaptation.
Continued research directions: Extensions include learned density ratio estimation for better distribution shift weights, adaptive schedules for entropy-weighting in multi-stage SFT→RL→distillation pipelines, and trust-region or KL-constrained variants for variance control (Qin et al., 17 Jul 2025, Xu et al., 25 Sep 2025, Jiang et al., 19 Aug 2024).
Robustness to quality and domain drift: Filtering and weighting by true or surrogate importance mitigate the risk of model collapse from semantically spurious, noisy, or high-shift samples.

7. Comparison Table of Core iw-SFT Variants

Variant	Weight Signal	Primary Domain	Key Reference
RL Lower Bound	$q(\tau)/\pi_\text{ref}(\tau)$	LLMs, control	(Qin et al., 17 Jul 2025)
Reward-Model	$e^{r_\phi(x, y)/\beta}$	LLM alignment	(Li et al., 28 May 2024)
Token Entropy (WeFT)	$1/t_i$ for $t_i \propto \sqrt{H(p_i)}$	Diffusion LLMs	(Xu et al., 25 Sep 2025)
DS-Weight Filtering	Empirical $w_i^\text{DS}$	Self-improving LLMs	(Jiang et al., 19 Aug 2024)

Each instantiation derives from the common principle of aligning the SFT objective more closely with the true RL objective or desired data distribution by non-uniform sample weighting. The particular signal (reward, uncertainty, distribution density) and practical weighting scheme depend on domain, learning modality, and computational considerations.