Anchored Supervised Fine-Tuning (ASFT)

Updated 5 October 2025

Anchored Supervised Fine-Tuning (ASFT) is a strategy that anchors standard fine-tuning to critical distributions, ensuring greater model stability.
It integrates extra loss terms such as KL-divergence regularization to mitigate over-memorization, distributional drift, and unsafe outcomes.
Empirical results show significant gains in generalization, safety, and performance across domains like language, vision-language, and code synthesis.

Anchored Supervised Fine-Tuning (ASFT) is a family of methodologies for post-training large models—particularly language and vision-language architectures—that introduces explicit anchoring to mitigate instability, over-memorization, distributional drift, and adverse effects in out-of-distribution (OOD) generalization and safety. ASFT augments supervised fine-tuning (SFT), which typically maximizes agreement with demonstration data or annotated labels, by introducing additional objectives or regularization terms that "anchor" optimization to critical distributions, semantic spaces, or parameter directions. This integrated anchoring yields substantial empirical gains across domains ranging from mathematical reasoning to safety-critical dialog, code synthesis, and OOD vision recognition.

1. Theoretical Foundations: Divergence Minimization and Anchoring

Supervised fine-tuning (SFT) for generative models is inherently a divergence minimization procedure between the empirical demonstration distribution $\rho^{\text{exp}}$ and the model policy $\rho^{\pi}$ (Sun, 2024). Formally, standard SFT minimizes the forward Kullback–Leibler (KL) divergence: $\min_{\pi} \ \mathbb{E}_{(s,a) \sim \rho^{\exp}}[ -\log \pi(a|s) ]$ This objective induces mass-covering behavior, ensuring broad coverage of the expert distribution but potentially leading to over-generalization.

Reverse KL and Jensen–Shannon alternatives enable mode-seeking or adversarial imitation variants, which may collapse to high-density modes at the expense of diversity. Anchored SFT (ASFT) introduces correction terms or reweighting schemes, typically derived from an explicit anchoring heuristic or theoretical analysis, to mitigate error compounding and distributional drift. In recent work, this anchoring is formalized as a KL regularization to the base (pretrained) model, yielding a stabilized and distribution-preserving update (Zhu et al., 28 Sep 2025): $\mathcal{L}_{\text{ASFT}}(\theta) = \mathcal{L}_{\text{DFT}}(\theta) + \lambda \, \mathbb{E}_{s}[\text{KL}(\pi_{\theta}(\cdot|s)\Vert\pi_{\text{base}}(\cdot|s))]$ where $\mathcal{L}_{\text{DFT}}$ is a reweighted fine-tuning loss and $\lambda$ controls anchoring strength.

2. Methodological Variants and Domain Applications

ASFT methodologies are instantiated in diverse contexts, each embedding the anchoring principle in loss design or supervision:

Reward-Weighted Regression (RWR) Anchoring: In the context of dynamic fine-tuning (DFT), ASFT augments probability-based weighting schemes with KL anchoring to yield tighter RL bounds and stability (Zhu et al., 28 Sep 2025). ASFT preserves the advantages of RWR-based DFT but forestalls progressive drift of the policy away from the base distribution—critical for both accuracy and generalization.
Semantic Anchoring for OOD in Vision-LLMs: Anchor-based robust fine-tuning incorporates auxiliary semantic anchors—either generated captions (text-compensated anchors) or retrieved image-text pairs from sources resembling pretraining data—to retain the open-vocabulary feature space of CLIP during fine-tuning (Han et al., 2024). This prevents collapse toward label-centric representations and sustains zero-shot performance alongside in-distribution accuracy.
Safety Anchoring via Parameter Subspace Regularization: AsFT anchors updates in parameter space along an empirically determined “alignment direction” (from base to safety-aligned model) and regularizes orthogonal components (the harmful direction), maintaining model safety within a narrow safety basin (Yang et al., 10 Jun 2025): $\mathcal{L} = \mathcal{L}_{\text{task}} + \lambda \left\| C_{\perp} \Delta W \right\|^2$ where $C_{\perp}$ projects updates onto the harmful direction and $\lambda$ is tunable.
Absolute Likelihood Anchoring for Preference Alignment: For aligning model outputs with human preferences, ASFT optimizes absolute likelihoods of preferred and dispreferred responses (rather than pairwise relative likelihoods), eliminating the reference model and yielding smooth, balanced gradients (Wang et al., 2024): $\mathcal{L}_{\text{align}} = -\log \sigma(f_\theta(x, y_{\omega})) - \log \sigma(-f_\theta(x, y_l))$
Annotation Anchoring via Scaling Law: High-quality SFT data acquisition is anchored by empirical scaling law feedback: annotated samples are iteratively refined until larger models consistently outperform smaller ones, as predicted by the theoretical scaling relationship (Kong, 2024). This objective anchoring ensures annotation validity prior to fine-tuning.

3. Stability, Generalization, and Safety Guarantees

ASFT’s core theoretical contribution is the prevention of progressive drift and instability symptomatic of reweighted or dynamic fine-tuning strategies. Dynamic reweighting using token probabilities or auxiliary distributions (as in DFT) improves RL lower bounds but lacks anchoring; distributional drift degrades training stability, leading to performance collapse in certain tasks (Zhu et al., 28 Sep 2025).

By directly introducing a KL-anchoring regularization, ASFT maintains a bounded divergence between the fine-tuned and base model distributions, yielding stable training trajectories. The preservation of semantic anchors in vision-LLMs averts mode collapse and maintains OOD robustness (Han et al., 2024). Safety anchoring constrains optimization strictly within regions empirically associated with robust, aligned behavior, suppressing harmful outputs even in adversarial or poisoned fine-tuning scenarios (Yang et al., 10 Jun 2025).

4. Empirical Performance and Computational Considerations

Empirical studies demonstrate consistent and substantial improvements of ASFT over SFT and DFT in domains where generalization, stability, and OOD robustness are critical:

Mathematical Reasoning: ASFT surpasses SFT/DFT by up to +17.89 percentage points on complex problems (Zhu et al., 28 Sep 2025).
Medical Knowledge Grounding: ASFT yields up to +10.65 points higher accuracy and consistent stability for small data (Zhu et al., 28 Sep 2025).
Code Generation: ASFT leads in average benchmark scores, outperforming SFT and importance-weighted fine-tuning (Zhu et al., 28 Sep 2025).
Safety: AsFT reduces unsafe output rates by up to 7.60% compared to Safe LoRA and maintains or slightly increases general accuracy (by ~3.44%) across settings (Yang et al., 10 Jun 2025).
Vision-Language OOD Generalization: Anchor-based finetuning achieves state-of-the-art results for both domain-shift and zero-shot test cases (Han et al., 2024).

Computational overhead introduced by anchoring regularization is modest: full-parameter ASFT increases training time by about 23.7% and achieves memory efficiency comparable to SFT when implemented via low-rank adaptation (LoRA) (Zhu et al., 28 Sep 2025). Vision-language anchoring via auxiliary contrastive losses demonstrates minimal additional cost (Han et al., 2024). Safety anchoring with approximate projections enables efficient scaling (up to 250× speedup over naive projection) (Yang et al., 10 Jun 2025).

5. Implementation Strategies and Open-Source Resources

ASFT is typically implemented by augmenting the fine-tuning objective with a regularization or auxiliary supervision term; hyperparameters (e.g., anchoring strength $\lambda$ , alignment scaling factor $\beta$ ) require empirical tuning. Reference models or base model pairs are required for KL anchoring; anchor-based vision-language finetuning necessitates a pretrained captioner and a retrieved image-text candidate set. Scaling law anchored annotation involves iterative annotation/evaluation loops, with statistical tracking across model size.

Key resources and code repositories are made available for reproducibility and practical adoption, such as:

ASFT safety anchoring: https://github.com/PKU-YuanGroup/AsFT (Yang et al., 10 Jun 2025)
Vision-language anchoring: Captioners, retrieval scripts; details in (Han et al., 2024)
SFT annotation pipeline: Github, HuggingFace, WandB for alpaca format and LoRA weights (Kong, 2024).

6. Practical and Research Implications

Anchored Supervised Fine-Tuning provides a principled path for augmenting efficiency and stability in post-training large models. It resolves the memorization/generalization trade-off by maintaining essential distributional properties, minimizes catastrophic drift, and ensures that task-specific adaptation does not undermine safety, OOD capability, or core alignment characteristics.

ASFT approaches are broadly applicable: they span domains (language, vision-language, code synthesis), yield tangible improvements under resource constraints, and can be integrated as modular enhancements to standard SFT pipelines. By leveraging anchoring—against reference models, semantic spaces, parameter directions, or statistical scaling laws—they promise robust, stable, and more reliable models for demanding downstream applications.

Potential limitations include the necessity of a well-specified anchor (distributional reference, semantic correlation, or parameter direction), hyperparameter sensitivity, and the assumption that base models reliably encode desired generalization or safety notions. Further research may address automated anchor selection, dynamic anchoring schedules, or extensions to multi-modal and reinforcement learning contexts.