Importance-Weighted SFT (iw-SFT)
- The paper introduces iw-SFT, which applies importance weighting to correct distribution mismatches between training data and the target policy, thereby providing a tighter lower bound on sparse-reward RL objectives.
- It leverages auxiliary proposal distributions, robust variance control techniques like KL anchoring and data rewriting, and adaptive fine-tuning strategies to maintain training stability.
- Empirical results demonstrate substantial performance gains in language modeling, mathematical reasoning, and continuous control, establishing iw-SFT as an efficient alternative to conventional RL methods.
Importance-Weighted Supervised Fine-Tuning (iw-SFT) generalizes conventional supervised fine-tuning (SFT) by explicitly correcting for distributional mismatch between training data and the target model policy through importance weighting. Recent work establishes iw-SFT as a principled bridge between SFT and reinforcement learning (RL), demonstrating that it both tightens the lower bound SFT provides on sparse-reward RL objectives and enables substantial performance gains in language modeling, mathematical reasoning, and continuous control domains. By leveraging auxiliary proposal distributions and robust variance-control mechanisms, iw-SFT achieves more faithful policy optimization with minimal computational overhead, and is easily adaptable to diverse data curation schemes and downstream applications.
1. Theoretical Foundations and Derivation
The canonical SFT objective is the maximum likelihood estimation (MLE) of a model on filtered or high-quality data, which can be construed as a loose lower bound on the expected RL return in sparse-reward settings. For an agent producing a trajectory , the RL objective is:
where is a terminal sparse reward indicator, and represents the trajectory distribution under policy . If only “successful” (i.e., ) rollouts drawn from a reference policy are available, the RL objective via importance sampling is:
Using the bound for , SFT arises as:
This connects SFT to the RL surrogate, but the bound loosens as the policy diverges from .
The iw-SFT surrogate objective obtains a tighter lower bound by introducing a proposal and rewriting:
so that the importance weight is . This construction ensures the surrogate approaches exact RL as , albeit with increased estimator variance. In practice, is selected as a lagged or smoothed copy of to control bias-variance trade-offs (Qin et al., 17 Jul 2025).
2. Variance Control and Implementation Strategies
Naive importance weighting incurs high estimator variance, particularly when the policy gap is large, leading to rapidly exploding weights and unstable optimization. Conventional mitigation includes:
- KL penalties/trust region methods (e.g., PPO, TRPO): constrain to stay close to , without changing the data distribution.
- Clipping: enforces , trading bias for bounded variance but leaving the data distribution unchanged.
A more direct reduction of policy gap at the data level is achieved by data rewriting. In this paradigm, supervised examples are partitioned into self-aligned (on-policy), guided-retell (near-policy), and fallback (off-policy) instances. For self-alignment, K samples from are generated and, if any matches the criterion for correctness, included as on-policy data. If self-alignment fails, a digest-and-retell prompt is used to guide the model to restate expert demonstrations, retaining these as near-policy data as appropriate. Only if both mechanisms fail does the fallback to original expert labels occur. This process constructs a new mixture distribution satisfying , directly reducing variance in subsequent iw-SFT (Zhao et al., 18 Sep 2025).
3. Generalizations and Extensions
Quality-Scored Data
When trajectories are assigned scalar quality scores , iw-SFT naturally generalizes by forming datasets and sampling from their union . The importance-weighted objective extends accordingly. This allows graded supervision to be integrated into iw-SFT, refining alignment between the surrogate and true RL return (Qin et al., 17 Jul 2025).
Token- and Group-Level Weighting
Variants such as SFT-GO assign importance weights on a per-token basis by segmenting sequences into groups (e.g., via TF-IDF, semantic, or excess-loss metrics). The group-based optimization objective is
where is the maximum cross-entropy loss among token groups, and anneals over training (Kim et al., 17 Jun 2025). This construction emphasizes challenging or salient regions of each sequence, boosting worst-group performance and yielding O(1/√T) convergence under convexity assumptions.
Anchored iw-SFT (ASFT) and Dynamic Fine-Tuning (DFT)
Reward-weighted regression places iw-SFT and DFT on a unified spectrum, varying the auxiliary distribution . In DFT, , so that weighting is proportional to current model probabilities, which leads to dynamic focusing on likely outputs but can incur instability due to drift away from . ASFT introduces a reverse KL regularization, penalizing divergence from a frozen base model, and retains both tightness of the RL lower bound and training stability (Zhu et al., 28 Sep 2025).
4. Algorithmic Details and Practical Considerations
A summary of the iw-SFT workflow:
| Step | Description | Reference |
|---|---|---|
| Proposal construction | Lagged, EMA, or smoothed copy of ; in DFT, | (Qin et al., 17 Jul 2025, Zhu et al., 28 Sep 2025) |
| Importance ratio eval | Trajectory-level or token-level log-ratios, possibly clipped or smoothed | (Qin et al., 17 Jul 2025) |
| Weighted loss | (Qin et al., 17 Jul 2025) | |
| Data rewriting | Preprocess to align distribution, reducing variance | (Zhao et al., 18 Sep 2025) |
| Regularization | KL-anchoring (ASFT), temperature scaling, group DRO, or clipping as needed | (Zhu et al., 28 Sep 2025, Kim et al., 17 Jun 2025) |
Empirically, incorporating proposal updates, bounded ratios, and data rewriting yields more robust and stable training. The additional computational cost is typically a single reference/proposal forward pass and minimal memory overhead due to lagged copies.
5. Empirical Performance and Benchmarks
iw-SFT and its variants show substantial performance improvements across reasoning, language modeling, and offline control:
- Language modeling/reasoning (AIME, MATH 500, GPQA): For Qwen2.5-32B-Instruct on the S1.1K benchmark, SFT achieved (AIME: 56.7%, MATH 500: 95.4%, GPQA: 63.6%) while iw-SFT attained (AIME: 66.7%, MATH 500: 94.8%, GPQA: 64.1%). Full-sequence weighting outperformed per-step weighting (Qin et al., 17 Jul 2025).
- Mathematical reasoning (Math500, Minerva Math, OlympiadBench, etc.): Data rewriting + DFT increased accuracy from SFT 23.23% to 42.03% on Qwen2.5-Math-7B (Zhao et al., 18 Sep 2025).
- Continuous control (D4RL): On Hopper-Medium-Replay, SFT achieved 79.0, iw-SFT(Q) 85.0. On Walker2d-Medium-Replay, SFT 58.8, iw-SFT(Q) 75.8 (Qin et al., 17 Jul 2025).
- Instruction tuning: SFT-GO increased Llama-3.1-8B’s average performance from 45.12 to 47.33 points using LLMLingua-2 grouping and from 50.02 to 51.21 on Alpaca data (Kim et al., 17 Jun 2025).
- Anchored SFT (ASFT): On 10k medical data, SFT 33.37% ASFT 42.03%; on math (100k), SFT 19.15% ASFT 30.50% (Zhu et al., 28 Sep 2025).
These results collectively demonstrate that iw-SFT closes much of the performance gap to more complex RL-based post-training while remaining efficient and stable.
6. Limitations and Stability Challenges
The principal technical challenge in iw-SFT is estimator variance due to large policy gaps. Data rewriting and KL-based anchoring directly address variance and drift instability by reshaping or constraining the data/model distribution. Alternative methods that rely solely on clipping or passive regularization are less effective, as they do not reduce divergence between data-generating and target policies. The effectiveness of each approach is highly sensitive to the choice of proposal distribution , the method for computing or approximating in off-policy data, the group assignments in DRO-based variants, and the annealing schedule for hyperparameters controlling regularization.
7. Outlook and Practical Recommendations
Importance-weighted SFT provides a theoretically justified, practically robust enhancement to classical supervised fine-tuning, offering a continuum between pure SFT and off-policy RL. Variants such as data-rewritten iw-SFT and anchored DFT (ASFT) combine the sample efficiency and stability of SFT with the performance benefits of RL. Practical guidelines recommend using lagged proposal policies for , KL-anchoring for drift control, and aggressive data rewriting to minimize variance in off-policy settings. Temperature annealing and group-based weighting further enhance convergence and robustness. These techniques render iw-SFT highly adaptable to domains where reward supervision is sparse, label distributions are highly non-uniform, and performance on challenging or rare subsets is critical (Qin et al., 17 Jul 2025, Zhao et al., 18 Sep 2025, Kim et al., 17 Jun 2025, Zhu et al., 28 Sep 2025).