Anchor-Repulsive Fine-tuning (ARF)

Updated 3 July 2026

Anchor-Repulsive Fine-tuning (ARF) is a strategy that regulates a model’s relation to a fixed anchor to maintain stability, prevent forgetting, and defend against parameter-space attacks.
It employs dynamic interpolation between current and frozen distributions along with margin-based penalties on attention weights to balance adaptation and security.
Empirical evidence shows that ARF enhances domain adaptation while preserving general capabilities, offering favorable trade-offs between performance and robustness.

Anchor-Repulsive Fine-tuning (ARF) refers to a class of fine-tuning strategies that deliberately manage or modify the relationship between a trainable model and a fixed reference ("anchor") to improve stability, prevent catastrophic forgetting, or strengthen model security against parameter-space attacks. Two established variants with disjoint motivations and methodologies are currently known: (1) distributional anchoring for stability in LLM fine-tuning (Wang et al., 6 May 2026) and (2) repulsive regularization to mitigate anchor-guided attacks in model parameter merging (Guo et al., 29 Jun 2026). Both are unified by their explicit manipulation of anchor relations in parameter or distribution space.

1. Distributional Drift and Anchored Learning in LLMs

Standard supervised fine-tuning (SFT) of LLMs often induces excessive distributional drift from the pre-trained or reference model, resulting in catastrophic forgetting of previously acquired capabilities even as the model improves on targeted downstream objectives. "Anchored Learning"—a distributional instantiation of ARF—explicitly regulates the distributional updates via a dynamically evolving anchor. At fine-tuning iteration $t$ , an anchor distribution $q^{(t)}(y|x)$ is synthesized by interpolating between the current model distribution $P_{\theta_{t-1}}(y|x)$ and the frozen reference $P_0(y|x)$ :

$q^{(t)}(y|x) = \alpha_t P_{\theta_{t-1}}(y|x) + (1-\alpha_t) P_0(y|x)$

Here, $\alpha_t$ is a trust-region parameter (either fixed or annealed). The model parameters are then optimized with respect to a combined loss:

$L(\theta) = \mathbb{E}_{(x, y) \sim \mathcal{D}}[-\log P_\theta(y|x)] + \lambda \, \mathbb{E}_{x \sim \mathcal{D}_x} \mathrm{KL}(P_\theta(\cdot|x) \parallel q^{(t)}(\cdot|x))$

This transforms unconstrained global SFT into a sequence of local trust-region updates in distribution space, sharply bounding distributional drift at each iteration (Wang et al., 6 May 2026).

2. Algorithmic Framework and Theoretical Guarantees

Anchored Learning alternates between (a) constructing $q^{(t)}$ and (b) minimizing the task-plus-KL objective. Choices of $\alpha_t$ control the radius of the trust region. Annealing $\alpha_t$ from high to low values across $q^{(t)}(y|x)$ 0 steps increases the model's plasticity over time. Theoretical analysis establishes tight KL-divergence bounds per update:

Probability-space: $q^{(t)}(y|x)$ 1
Logit-space: $q^{(t)}(y|x)$ 2
Dynamic convergence: With constant $q^{(t)}(y|x)$ 3, $q^{(t)}(y|x)$ 4

These results guarantee that updates remain within a bounded KL "ball," enforcing stability and mitigating catastrophic forgetting (Wang et al., 6 May 2026).

3. Practical Implementation and Hyperparameter Tuning

Anchor-Repulsive Fine-tuning for LLMs requires maintaining both the frozen anchor and evolving anchor logits. Computational overhead is approximately 1.5× that of SFT due to the KL and anchor distribution calculations. Recommended hyperparameter ranges include $q^{(t)}(y|x)$ 5 and $q^{(t)}(y|x)$ 6 (fixed) or annealed from 0.7 to 0.3. Empirical ablations on the MedCalc benchmark (Qwen2.5-3B) demonstrate that strong anchoring ( $q^{(t)}(y|x)$ 7) yields minimal forgetting but moderate in-domain gains, while looser anchoring ( $q^{(t)}(y|x)$ 8) increases adaptation but at the cost of stability. Monitoring mean KL-divergence per update on held-out data enables adaptive adjustment (Wang et al., 6 May 2026).

4. Empirical Outcomes: Performance, Stability, and Trade-offs

Experimental results on Qwen2.5-3B-Instruct across iGSM (grade-school math), MedCalc (medical calculations), and IFEval (instruction following) benchmarks establish that anchoring achieves near-optimal domain-specific adaptation while drastically reducing the collapse of general abilities observed in SFT. Notably, domain accuracy on iGSM reaches 0.936 (anchored) vs. 1.000 (SFT) with the general score preserved at 0.377 (anchored) versus 0.031 (SFT). On MedCalc, anchored methods achieve a twofold reduction in forgetting relative to standard approaches, preserving both general and domain performance (Wang et al., 6 May 2026).

Method	General Avg	iGSM Acc	MedCalc Acc	IFEval Acc
Base	0.452	0.144	0.133	0.398
SFT	0.031	1.000	0.574	0.694
Low-SFT	0.374	0.954	0.544	0.449
KL-SFT	0.102	0.606	0.356	0.504
Anchored/ARF	0.377	0.936	0.563	0.652

A plausible implication is that anchored updates produce models that remain consistently closer in KL space to their initialization, permitting stable deployment for tasks requiring retention of general capabilities.

5. Model Merging, Adversarial Attacks, and ARF as a Security Defense

In model merging scenarios, where parameter-level defenses employ linear transformations to obfuscate fine-tuned weights, the "Anchor-Guided Attack" (AGA) exploits situations where the protected task vector is much smaller in magnitude than the anchor. AGA analytically recovers secret transforms by aligning the protected model to the anchor. Anchor-Repulsive Fine-tuning in this context imposes a margin-based penalty during fine-tuning, forcing the attention projection weights to deviate by at least a fraction $q^{(t)}(y|x)$ 9 of the anchor norm:

$P_{\theta_{t-1}}(y|x)$ 0

with composite loss $P_{\theta_{t-1}}(y|x)$ 1. After achieving the prescribed margin for all relevant weights, post-hoc invertible (orthogonal or diagonal) transformations are applied to further obscure parameters (Guo et al., 29 Jun 2026).

6. Security Efficacy and Comparative Analysis

Empirical studies benchmark ARF against MergeGuard under model-merging attacks, measuring post-attack accuracy on defended tasks. Across all tested architectures and merging schemes (Task Arithmetic, CAT, LOT), ARF yields lower post-attack task accuracy than MergeGuard even without attack, indicating stronger protection. Standalone accuracy loss with ARF is negligible relative to vanilla fine-tuning. Example results:

Protect Method	ViT-B/32 (AGA)	ViT-L/14 (AGA)	Qwen2-7B (AGA)
MergeGuard	50.56%	63.60%	46.23%
ARF (ours)	27.42%	34.19%	28.17%

A plausible implication is that margin-based repulsion during fine-tuning subverts the geometric premises exploited by AGA, dramatically reducing the risk of analytic transform recovery (Guo et al., 29 Jun 2026).

7. Limitations and Open Directions

Known limitations include the confinement of ARF's repulsion penalty solely to attention projections (with MLP layers still guarded only by post-hoc permutations), static global margin parameters rather than adaptive layerwise tuning, and an assumption of static anchors rather than accounting for continual or online fine-tuning. Interactions between ARF and alternative fine-tuning schemes (e.g., LoRA, prompt tuning) remain unexplored (Guo et al., 29 Jun 2026). In anchored learning for LLM stability, the method necessitates modest additional compute and memory, especially for very large models.

In summary, ARF realizes explicit anchor management—either through dynamically interpolated distributional objectives for stability and retention, or through margin-based repulsion for security—resulting in empirically verified improvements in LLM fine-tuning and robust defense against anchor-guided attacks. The approach is theoretically grounded via strict per-update divergence bounds and shown to offer favorable trade-offs between domain adaptation and generalization or security performance (Wang et al., 6 May 2026, Guo et al., 29 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Stabilizing LLM Supervised Fine-Tuning via Explicit Distributional Control (2026)

On the Vulnerability of Parameter-Level Defenses to Model Merging (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anchor-Repulsive Fine-tuning (ARF).