- The paper introduces InfoSFT, which strategically weights tokens based on their likelihood to optimize fine-tuning and reduce overfitting and catastrophic forgetting.
- It leverages a generalized likelihood weighting framework that emphasizes medium-confidence tokens to balance new learning with retention of existing capabilities.
- Empirical results across math, code, and reasoning tasks demonstrate that InfoSFT outperforms standard SFT and DFT, achieving better pass@k metrics and stability.
Motivation and Problem Statement
Supervised fine-tuning (SFT) is the de facto approach for adapting LLMs to new offline expert demonstrations. However, standard SFT exhibits two fundamental deficiencies: (i) poor test-time generalization due to overfitting on low-likelihood tokens/samples, and (ii) catastrophic forgetting, wherein adaptation to unlikely samples substantially distorts the base policy, degrading prior capabilities. Existing remedies (filtering, down-weighting, or regenerating low-likelihood data) attenuate overfitting but inadvertently suppress essential novel behaviors encoded in the data tail—precisely those the model must learn.
InfoSFT addresses the following: How can LLMs be robustly fine-tuned to expert data, maximizing acquisition of novel behaviors while minimizing overfitting and catastrophic forgetting? The central claim is that the optimal learning signal is concentrated not on uniformly or likelihood-proportionally chosen tokens, but on an "information-aware" regime: tokens with intermediate (medium) likelihood under the model, which are simultaneously informative yet not distributionally destabilizing.
Technical Approach
Generalized Likelihood Weighting Framework
InfoSFT recasts SFT as a response/token-level weighted objective, generalizing standard SFT (uniform weighting) and DFT Wu et al., 2025 as special cases. For each sample (x,y∗), the weight is a function Q(q) of model likelihood q=π0​(y∗∣x):
- SFT: Q(q)=1/q (uniform gradient)
- DFT: Q(q)=1 (likelihood-proportional gradient)
InfoSFT derives the optimal weighting function from a proximal (KL-constrained) update perspective, motivating the assignment:
$w_{\text{InfoSFT}}(q) = q \left( \logit(p) - \logit(q) \right)$
where p is a calibration constant approximating average "expert" probability, estimated from the student model's own output distribution.
This weighting profile inherently suppresses both trivial (high-likelihood) and excessively rare (low-likelihood/noisy) tokens, focusing updates on tokens that are sufficiently surprising to contain new information but not so rare as to introduce instability.
Theoretical Underpinnings
Section 4 formalizes the InfoSFT weighting rule in the context of an oracle-guided KL-proximal policy optimization framework. It is proven that, under a fixed KL divergence (distributional shift) budget with respect to the base model, the optimum tradeoff between learning new data and preserving prior performance is achieved by emphasizing medium-confidence tokens—those not already known (high-likelihood under the model) and not outliers (extremely low-likelihood).
InfoSFT's token weighting is mathematically shown to strictly dominate both SFT and DFT in reducing the expected KL to the true expert distribution under general, practical conditions relevant to SFT. Negative weights (for overconfident tokens) are clipped in practice to maintain fluency and avoid degenerate updates.
InfoSFT is closely related to (and in part justifiable as) an entropy correction to DFT; it can be seen as injecting an information-theoretic "surprisal" signal into the fine-tuning gradient.
Empirical Results
InfoSFT is comprehensively evaluated on a variety of LLMs (Qwen-2.5-Math-1.5B/7B, Llama-3.1-8B) and tasks (math, code, chain-of-thought reasoning). Several robust findings are established:
- Math and Code: InfoSFT dominates SFT and DFT in both MATH500 and AMC for Qwen variants, with absolute improvements (e.g., +6 pts on AMC vs SFT/DFT for Qwen-1.5B). For code (HumanEval), InfoSFT either matches or exceeds DFT; on Llama-3.1-8B, InfoSFT achieves a 3-point advantage.
- Pass@k Metrics: InfoSFT excels at higher pass@k, a critical metric for downstream RL or sampling, indicating effective learning beyond greedy decoding. Notably, DFT often suffers entropy (diversity) collapse, while SFT maintains high output entropy but weaker pass@1; InfoSFT balances both.
- Catastrophic Forgetting: Across "Science Q&A" and "Tool Use" benchmarks, InfoSFT realizes superior new-task/prior-capability tradeoff curves. It consistently achieves higher new-task performance with less regression in prior capability than SFT or DFT, especially at comparable KL budgets (controlled via learning rate and training epochs).
- Reasoning (CoT) Data: On novel reasoning formats with very low initial likelihood, InfoSFT alone is suboptimal (as with DFT) since necessary tokens are downweighted; however, a hybrid SFT→InfoSFT curriculum capitalizes on SFT's strength in initial mode acquisition and InfoSFT's strength in robust subsequent learning.
InfoSFT is hyperparameter-light (only a single line change to the SFT loss function, with p robustly set to $0.93$ across tasks and model scales), making it practically attractive.
Implications, Limitations, and Future Directions
InfoSFT recasts supervised fine-tuning as an information-targeted process, providing a theoretically-justified, empirically validated alternative to both uniform and simple likelihood-proportional weighting. Its primary implication is that post-training pipelines for LLMs should critically reevaluate the default SFT objective, particularly for settings where retention of prior capabilities is essential (e.g., safety, continual learning, multi-task generalization).
Practical adoption is straightforward, and InfoSFT can be layered with RLHF, SFT, or DFT-based pipelines. The method's success highlights the leverage available in basic objective design, even prior to data curation or reinforcement alignment.
There remain limitations in extremely "format-shifting" scenarios (e.g., CoT with wholly novel structure), where curriculum learning or multi-stage adaptation (SFT→InfoSFT), or more adaptive, context-conditional weighting, may prove superior. An open theoretical direction is extending the KL-proximal analysis to richer reward structures—potentially integrating external evaluators or discriminator signals beyond per-token likelihood.
Another salient avenue is developing continual learning methods based on InfoSFT's insights, explicitly leveraging information-aware control for memory and stability in sequential post-training and domain adaptation contexts.
Conclusion
InfoSFT offers a rigorous solution to supervised LLM fine-tuning, centering on an information-aware, medium-confidence gradient signal that learns efficiently and forgets minimally. Across code, math, reasoning, and alignment-style tasks, InfoSFT secures state-of-the-art tradeoffs, making token-weighting a crucial principle in scalable, robust LLM post-training (2605.14967).