InfoSFT: Information-Aware Fine-Tuning
- InfoSFT is a family of supervised fine-tuning methods that incorporate explicit information signals like token confidence and reference model deviation.
- It employs techniques such as token weighting, deviation-aware objectives, and centered log-likelihoods to control policy shift and improve generalization.
- The approach optimizes training by reallocating gradient budgets based on metrics like perplexity, Fisher information, and sparse representational changes.
InfoSFT denotes a family of information-aware supervised fine-tuning formulations in which the standard SFT objective is no longer treated as “just cross-entropy,” but is modulated by an explicit signal about what the base model already knows, how far the updated policy moves from a reference model, how informative a token or example is, or how strongly a training distribution aligns with the model’s own generative distribution. In current usage, the label appears both as a broad organizing lens over SFT research and, in one specific instance, as the title of a token-weighted objective; across these usages, the common theme is to optimize supervised adaptation under an explicit information criterion rather than under uniform likelihood maximization alone (Xie et al., 2024, Sabbaghi et al., 14 May 2026, Harada et al., 17 Jun 2025, Zhang et al., 12 Feb 2026).
1. Terminology and conceptual scope
The term is not tied to a single algorithm. In one line of work, it refers to SFT objectives that monitor or constrain deviation from a frozen reference model through a log-likelihood ratio. In another, it denotes token-wise weighting that emphasizes “maximally informative, medium-confidence tokens.” Elsewhere it names broader analyses of SFT through perplexity, Fisher information, distribution alignment, or sparse representational structure. This suggests that “InfoSFT” is best understood as a research program centered on explicit information signals inside SFT rather than as a single canonical loss (Xie et al., 2024, Sabbaghi et al., 14 May 2026, Harada et al., 17 Jun 2025, Deb et al., 20 May 2025).
| Strand | Core information signal | Representative paper |
|---|---|---|
| Deviation-aware SFT | Log-likelihood ratio to a reference model | (Xie et al., 2024) |
| Token weighting | Medium-confidence token probability | (Sabbaghi et al., 14 May 2026) |
| On-policy SFT | Centered Log-Likelihood (CLL) | (Zhang et al., 12 Feb 2026) |
| Dataset selection | Perplexity under the base model | (Harada et al., 17 Jun 2025) |
| Data-efficient selection | Fisher information / Hessian log-det | (Deb et al., 20 May 2025) |
| Mixture scheduling | Overfitting-aware per-sub-dataset peaks | (Koh et al., 23 Mar 2026) |
| Mechanistic localization | Sparse latent drift or sparse carriers | (Chopra, 12 May 2026, Lin et al., 7 May 2026) |
A common misconception is to equate InfoSFT with explicit KL regularization alone. The literature is broader. Some methods use an explicit reference model throughout training; some use only current-model token probabilities; some operate at the level of dataset choice or mixture scheduling; and some are diagnostic rather than prescriptive. The shared premise is that SFT quality depends on where learning signal is allocated and how much information is injected relative to the base policy, not only on maximizing training-set likelihood.
2. Reference-model deviation and information-aware objectives
A central early formulation treats InfoSFT as deviation-aware SFT. “Minor SFT loss for LLM fine-tune to increase performance and reduce model deviation” introduces MinorSFT, which imports ideas from DPO and MinorDPO into supervised fine-tuning by using the sequence-level log-likelihood ratio
both as a discrepancy measure and as the basis for dynamic weighting (Xie et al., 2024).
The paper’s starting point is the asymmetry between RLHF-style methods and plain SFT. PPO uses an explicit KL penalty to the reference model; DPO, IPO, KTO, and MinorDPO all incorporate KL-like or reference-relative controls. By contrast, plain SFT simply maximizes
with no explicit term limiting distance to . The paper argues that this creates “no explicit control of deviation from the base model,” “risk of overfitting to instruction/domain data,” “loss of generality/diversity,” and “instability / sensitivity to learning rate” (Xie et al., 2024).
MinorSFT addresses this through a sample-wise coefficient applied to the standard SFT gradient:
This produces a distinctive behavior. Early in training, when , the coefficient is approximately the usual token average. If the optimized model already assigns much higher probability to a sample than the reference, the coefficient shrinks toward zero; if the optimized model still assigns lower probability than the reference, the coefficient grows toward $2/m$. The method therefore downweights already-overfit or “easy” samples and emphasizes “hard or underfit samples.”
The same work introduces a normalized deviation metric,
explicitly normalized for answer length and independent of . On Qwen2-7B-Instruction, with domain adaptation on a private finance-related corpus and evaluation on FinanceIQ, fineval, and ceval-exam, the reported best settings were raw SFT at and MinorSFT at with 0. MinorSFT achieved the best accuracy on all three datasets, while its deviation metric remained lower than raw SFT even at the larger learning rate (Xie et al., 2024).
Conceptually, this line defines InfoSFT as implicit KL-regularized learning without an explicit KL term: the base model acts as a prior, and the posterior/prior ratio modulates gradient flow. The concrete contribution is not merely a new loss, but a change in what SFT is optimizing for: supervised fit under controlled policy shift.
3. Token-level information weighting and on-policy SFT
A later paper titled “InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting” recasts the problem at token resolution. Its premise is that uniform SFT gives every demonstrated token equal weight even though extremely low-likelihood tokens induce large updates and substantial policy shift, while very high-likelihood tokens are already “solved” and need little additional pressure. The proposed weight is
1
where 2 and 3 is an estimated average expert token probability, fixed to 4 as a robust default (Sabbaghi et al., 14 May 2026).
The weighting is bell-shaped in confidence. As 5, 6, so extremely unlikely tokens receive vanishing weight. When 7, the clipped factor becomes zero, so high-confidence tokens receive no further push. The peak lies in the middle-confidence region. The paper therefore defines the target of learning as “maximally informative, medium-confidence tokens.” In proximal-update analysis under a fixed KL budget from the base model, it derives an oracle rule proportional to 8 and uses 9 as a practical approximation (Sabbaghi et al., 14 May 2026).
Empirically, this objective improves generalization across math, code, and chain-of-thought settings. On Qwen-2.5-Math-1.5B fine-tuned on NuminaMath-CoT, MATH500 acc@1 rose from 61.6 with SFT and 59.2 with DFT to 66.2 with InfoSFT. On Qwen-2.5-Math-7B, the corresponding numbers were 53.4, 65.4, and 69.7. On Llama-3.1-8B, they were 24.0, 15.5, and 27.8. The paper also reports better pass@k behavior and a better trade-off between new-task performance and prior-capability retention in Science QA and Tool Use continual-fine-tuning experiments (Sabbaghi et al., 14 May 2026).
A distinct but closely related formulation appears in “Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training.” That work argues that the main gap between SFT and RL is on-policy data. Its key discriminant is Centered Log-Likelihood,
0
Under in-distribution sampling, 1, so cumulative CLL forms a martingale; under OOD data, it has negative drift dominated by 2. On this basis the paper defines In-Distribution Finetuning (IDFT),
3
and Hinted Decoding, which fuses an imitator distribution and the main model distribution through
4
The reported result is generalization performance on par with DPO and SimPO on objective tasks while keeping an SFT pipeline (Zhang et al., 12 Feb 2026).
Taken together, these token-level methods define InfoSFT as a family of supervised objectives that allocate gradient budget according to an information signal. The signal may be medium confidence, centered likelihood, or estimated distribution mismatch, but the principle is the same: supervised data should not be fitted uniformly.
4. Data, perplexity, Fisher information, and mixture scheduling
Another strand locates information not in the token loss but in the training corpus. “Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality” trains 1,070 SFT models, with 1,059 successful runs, across 12 base models, 10 SFT datasets, and 12 benchmarks. Its strongest empirical rule is simple: lower perplexity of the training data under the base model leads to larger SFT performance gains. By contrast, semantic similarity between train and test measured by BERTScore F1 had Pearson’s 5, 6, and average token length showed only a modest correlation (Harada et al., 17 Jun 2025).
This result reorients dataset choice. The paper reports that Alpaca and UltraChat produce broad improvements, math datasets strongly boost MATH and GSM8K, Magicoder improves HumanEval and MBPP and often transfers more broadly than math corpora, while FLAN subsets are often harmful outside matched domains. It also reports that 5 principal components explain 7 of variance in the dataset-by-benchmark gain matrices, indicating a shared low-dimensional structure, while dataset-dataset synergies remain highly model-specific (Harada et al., 17 Jun 2025). A plausible implication is that an InfoSFT pipeline should screen candidate corpora by base-model perplexity before worrying about topical similarity.
“FisherSFT: Data-Efficient Supervised Fine-Tuning of LLMs Using Information Gain” makes this point fully formal. It linearizes the LLM at the last layer as multinomial logistic regression over pre-logit features 8, uses the Hessian/Fisher information of the log-likelihood, and selects a subset of sentences maximizing a D-optimal log-det criterion. The ideal objective is
9
and the tractable surrogate becomes
0
The paper then applies a greedy design algorithm and reports that in GPT-2 Shakespeare fine-tuning, GPT-4o preferred FisherSFT generations to baseline methods more than 50% of the time at all evaluated subset budgets (Deb et al., 20 May 2025).
Mixture scheduling extends information-aware allocation from subset choice to training time. “mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT” treats each sub-dataset 1 as having its own optimal compute 2. It repeatedly trains on the active mixture, identifies the earliest-overfitting sub-dataset, rolls back to that checkpoint, and excludes the overfit dataset from future training. Across 6 base models and 10 benchmarks, the reported average accuracies are 61.9 for standard SFT, 62.1 for DynamixSFT, 62.5 for IES, and 63.7 for mSFT (Koh et al., 23 Mar 2026). Although this work does not use the InfoSFT label, it operationalizes an allied principle: compute should be reallocated when a sub-dataset’s marginal contribution to generalization becomes non-positive.
5. Mechanistic and representational interpretations
InfoSFT is also a mechanistic perspective on what SFT changes inside a model. “A Mechanistic Investigation of Supervised Fine Tuning” shows that raw residual-stream cosine similarity before and after SFT remains extremely high—across all tasks and layers it stays at least 3—yet sparse latent activations obtained from frozen GemmaScope 2 SAEs diverge sharply. For MultiNLI at step 2000, raw cosine at layer 22 is 0.960, while latent cosine in a 262k-width SAE is 0.557. The paper therefore argues that dense activation geometry can remain nearly unchanged while the model substantially rewrites which sparse semantic features are active and how strongly (Chopra, 12 May 2026).
That same work reports task- and layer-specific structure in the update. Early layers are low-rank; late layers are more distributed. WildJailbreak safety alignment is singled out by a distinctive layer-wise profile: layer 7 to layer 22 flipped-feature ratio is 3.00, whereas the corresponding ratios for GSM8K, MultiNLI, and ToolCalling are 0.50, 1.00, and 0.20. Safety alignment therefore appears to rewrite shallower representations more heavily than the capability tasks examined (Chopra, 12 May 2026).
A different mechanistic line asks whether SFT behaviors can be localized and reversed. “Crafting Reversible SFT Behaviors in LLMs” introduces Loss-Constrained Dual Descent (LCDD), which compresses an SFT-induced behavior into a sparse “carrier,” and SFT-Eraser, a soft prompt that matches carrier activations back to base-model activations. For Fixed Response behavior, LCDD yields sparsity between 73% and 84% while preserving fixed-response rates of 97.5–100%; the trigger then drives fixed-response rate to 0–1% and shifts output distributions back toward the base model (Lin et al., 7 May 2026). The paper argues that sparse structure is the precondition for reversal, since the same trigger optimization is much less effective on standard SFT models without a crafted carrier.
“Procedural-skill SFT across capacity tiers” pushes the same idea into a controlled capability setting. On Qwen3.5 0.8B, 2B, and 4B, using 353 curated procedural-skill demonstrations, the SFT-attributable procedural-4 lift is 5, 6, and 7, while the pre-SFT base trajectory is W-shaped: 8, 9, 0, with Claude Haiku 4.5 at 1. The paper’s interpretation is “regime-asymmetric”: SFT works hardest in absolute terms where the base struggles with the procedure (Strozzi, 12 May 2026). This suggests that an information-aware reading of SFT must distinguish what the corpus adds from what the base model can already do with the prompted interface.
6. Empirical synthesis, misconceptions, and open directions
Several conclusions recur across the literature. First, InfoSFT is not identical to “downweight low-likelihood data.” The 2026 token-weighting paper explicitly criticizes schemes that suppress low-likelihood tokens indiscriminately, because such tokens often encode the novel behavior the base model has yet to learn; its solution is to emphasize medium-confidence tokens instead of either all tokens or only high-likelihood ones (Sabbaghi et al., 14 May 2026). Second, more data is not uniformly better. The large-scale 2025 study reports that 1k-example SFTs often sit near the center of the instruction-following manifold, while 20k-example variants move to the periphery and do not consistently outperform (Harada et al., 17 Jun 2025). Third, high activation cosine does not imply negligible internal change: sparse-feature analyses show large representational drift beneath small dense rotations (Chopra, 12 May 2026).
The field also has unresolved tensions. Deviation metrics such as the normalized log-likelihood ratio in MinorSFT track how far the model moves from the base, but do “not directly tell you if the model is over- or underfitting” relative to the right objective (Xie et al., 2024). Distribution-alignment methods such as DDT, IDFT, and Hinted Decoding perform best in objective-correctness domains like math, while the same paper explicitly leaves subjective value alignment and fully online on-policy loops as open directions (Zhang et al., 12 Feb 2026). Large-scale SFT analyses remain concentrated around the 7B–9B regime, and the 2025 study explicitly asks whether the perplexity law and mid-layer dominance hold at 34B, 70B, or frontier scales (Harada et al., 17 Jun 2025).
Mechanistic work likewise leaves open problems. Reversible-carrier methods currently use soft prompts rather than discrete triggers, and style behaviors are materially less compressible than fixed-response or safety behaviors (Lin et al., 7 May 2026). SAE-based drift analysis has so far been run on Gemma 3 1B IT with a fixed interpretability basis, and its authors point to RLHF, DPO, larger models, and broader layer coverage as future work (Chopra, 12 May 2026). Procedural-skill SFT results are single-seed and family-specific, with explicit calls for 8B and 14B tests to validate the predicted regime shift (Strozzi, 12 May 2026).
Across these variants, the unifying proposition is stable. Supervised fine-tuning is most effective when training signal is allocated according to information compatibility with the base model: the reference-relative log-likelihood ratio can control policy shift; medium-confidence token weights can improve the KL-efficiency of offline learning; centered log-likelihood can discriminate in-distribution from OOD data; base-model perplexity can rank datasets; Fisher information can select subsets; and sparse mechanistic analyses can reveal where the SFT transformation is actually encoded. In that sense, InfoSFT names a shift from viewing SFT as blind likelihood maximization to viewing it as controlled information reallocation within a pretrained model.