Anchor-Guided Attack (AGA)

Updated 3 July 2026

Anchor-Guided Attack is an adversarial method that uses predefined reference anchors to direct optimization through feature, logit, or parameter alignment.
It improves attack transferability and recovery in model merging and vision-language tasks via cosine similarity, distance minimization, and linear assignment methods.
Practical insights show its reliance on anchor quality, vulnerability to countermeasures, and broad applicability across neural, graph, and multimodal models.

An Anchor-Guided Attack (AGA) is a class of adversarial methodology in which the attacker leverages reference “anchors”—which may be images, function mappings, internal representations, or model parameters—as optimization targets or guidance mechanisms during attack construction. The anchor explicitly steers the adversarial generation process, either by inducing feature alignment (forcing internal representations toward the anchor), logit-level calibration (combining logits from anchor states), or parameter-level inversion (aligning protected weights with anchor parameters). AGAs have been developed in diverse contexts, including model merging and parameter protection, adversarial example synthesis in vision and multimodal models, graph adversarial attacks, and optimization hyper-heuristics.

1. Anchor-Guided Attack: Core Definition and Principle

At its core, an Anchor-Guided Attack is defined by the use of one or more anchors—predefined reference points in input, function, feature, or parameter space—to steer the adversarial optimization process. The anchor can be:

a clean example’s output distribution or logit vector (Sriramanan et al., 2020)
an external or synthetic prototype image semantically tied to a target prompt (Wang et al., 2 Feb 2026)
the latent representation or parameter matrix of a public pretrained model (Guo et al., 29 Jun 2026)
a dynamically selected internal model state, such as a transformer layer chosen by attention patterns (Wu et al., 11 Apr 2026)
a node or set of nodes in a graph structure representing plausible attack pathways (Zhu et al., 2023)

The optimization objective incorporates explicit alignment or calibration with respect to the anchor(s), often via cosine similarity, distance minimization, or logit manipulation. The fundamental purpose is either to maximize the effect of an adversarial perturbation, maximize transferability under heterogeneity, or invert protective transformations that obscure model parameters.

2. Anchor-Guided Attack in Parameter-Level Model Merging

A canonical, explicit realization of AGA is found in the context of model merging with parameter-level defenses (Guo et al., 29 Jun 2026). In this setting, specialized models are protected via secret invertible linear (or permutation) transformations of their parameters. The critical observation is that these protected models remain overwhelmingly dominated by their public pretrained backbone; the task vectors arising from fine-tuning are small compared to the backbone weights. Thus, the attacker chooses the public pretrained model as a static reference anchor and seeks an analytic transform that re-aligns the protected (obfuscated) model to this anchor. For an attention module with protected weight $W^p$ and anchor $W_{pre}$ , the optimal recovery transform $T^*$ is obtained by:

$T^* = \operatorname*{argmin}_T \| W^p T - W_{pre} \|_F^2$

yielding the solution

$T^* = ((W^p)^T W^p)^{-1} (W^p)^T W_{pre}$

and the recovered model $W^a = W^p T^* \approx W_{ft}$ . For discrete protections (e.g., permutations in MLPs), the attack solves a linear assignment problem using negative cosine similarity as the cost.

This anchor-guided inversion is universal for linear parameter-level defenses, as the geometric dominance of the anchor ensures analytically bounded recovery error ( $\|W^a - W_{ft}\|_F \leq \|\tau\|_F$ ). Empirical results show that AGA restores protected-task merging performance to near-unprotected baselines across multiple architectures and tasks, unless specific countermeasures eliminate anchor dominance (Guo et al., 29 Jun 2026).

3. Anchor-Guided Attacks in Transferable Adversarial Example Generation

In the transfer-based adversarial attack literature for vision-language and vision models, anchor-guided principles have been operationalized by generating multiple target-like reference images (“anchors”) and using them as mixture targets for feature or output space alignment. SGHA-Attack formalizes this as follows (Wang et al., 2 Feb 2026):

Synthesize a pool of reference images $\mathcal{X}_{ref}$ from the target prompt using a frozen text-to-image model.
Select the Top- $K$ semantically relevant anchors $\{\boldsymbol{x}_{anc}^k\}_{k=1}^K$ using surrogate image-text cosine similarity:

$W_{pre}$ 0

Compute temperature-scaled softmax weights $W_{pre}$ 1 over these anchors.
Minimize a weighted anchor-guided loss for the adversarial image:

$W_{pre}$ 2

Propagate this guidance into both final and intermediate feature layers, yielding hierarchical alignment (HVSA) and cross-modal synchronization (CLSS).

Ablation studies indicate that anchor-guided alignment substantially improves targeted transferability, with 20–70 percentage point increases in targeted attack success rate over text-only or single-anchor baselines, and that optimal $W_{pre}$ 3 (typically 5) and temperature control are critical (Wang et al., 2 Feb 2026).

4. Anchor-Guided Attack Mechanisms Across Domains

Several domains have adopted anchor-guided mechanics, with precise domain-specific implementations:

Adversarial attack on classifiers: Guidance terms penalize deviation between the adversarial and clean sample softmax outputs; the anchor is the clean sample’s output distribution (Sriramanan et al., 2020).
Token-level decoding in MLLMs: Dual-anchor introspection uses dynamically selected transformer layers as Spotlight and Shadow anchors, based on visual attention scores, to additively/subtractively calibrate next-token probability distributions (Wu et al., 11 Apr 2026).

$W_{pre}$ 4

Partial graph attacks: Attack budget is allocated to vulnerable nodes, and anchor nodes for edge perturbations are selected based on class-proximity or proximity in the graph; gradient-based anchor scoring identifies high-impact perturbations (Zhu et al., 2023).

5. Theoretical Properties and Defensive Countermeasures

The theoretical effectiveness of AGAs in parameter-level inversion has been formally established: recovery error is proportional to the norm of the fine-tuning (task) vector, and vanishes as the protected model approaches the pretrained anchor (Guo et al., 29 Jun 2026).

To counteract AGA, the Anchor-Repulsive Fine-tuning (ARF) defense enforces a lower bound on the task vector’s norm during finetuning, breaking the anchor-dominance assumption and rendering least-squares anchor alignment ineffective. After ARF, merging recovery drops below 30% success under AGA, compared to $W_{pre}$ 590% for prior defenses (Guo et al., 29 Jun 2026).

Other anchor-guided attacks may be defeated or degraded if the anchor is poorly chosen, if the target class is inadequately represented in the anchor set, or if the victim model’s processing flow differs significantly from the surrogate or reference model used for anchor definition.

6. Comparison With Anchor-Free and Alternative Guidance Attacks

AGA is distinct from anchor-free adversarial attacks, which are specifically designed for architectures lacking anchor/proposal structures (e.g., anchor-free object detectors). Category-wise and semantic-region attacks on anchor-free detectors rely on heatmap or keypoint semantics—not anchor guidance—and achieve transfer across detection paradigms by attacking native model decision carriers rather than explicit anchors (Xie et al., 2023, Liao et al., 2020). This contrast emphasizes that AGA is meaningful primarily where the model exposes actionable anchor structures (in parameter, feature, or attention space).

Anchor-guided search methods also appear in optimization and hyper-heuristics (Zhao et al., 13 May 2026), where anchors (reference configurations) guide local search refinement but do not correspond to adversarial intent.

7. Bibliographic Context and Limitations

Prominent research contributions to the anchor-guided attack paradigm include:

Parameter-level inversion for model merging via anchor alignment (Guo et al., 29 Jun 2026)
Semantic-guided anchor alignment for targeted transfer in VLMs (Wang et al., 2 Feb 2026)
Dual-anchor decoding for hallucination mitigation (Wu et al., 11 Apr 2026)
Clean-output-guided adversarial margin attack (Sriramanan et al., 2020)
Anchor-guided graph perturbation selection (Zhu et al., 2023)

Limitations of current AGAs include dependence on anchor quality and diversity, reduced efficacy if anchor-to-target distances are large or fuzzy, and sensitivity to architectural heterogeneity not controlled in anchor selection. In parameter-level settings, defense efficacy is critically tied to the size of the non-anchor (task) vector component and to explicit geometric repulsion from anchor basins.

8. Summary Table: AGA Instantiations Across Domains

Domain	Anchor Definition	Mechanism	Reference
Model merging	Pretrained parameters	Parameter alignment, inversion	(Guo et al., 29 Jun 2026)
Vision-language	T2I-generated images	Weighted consensus in feature space	(Wang et al., 2 Feb 2026)
Classifier attack	Clean output softmax	Loss relaxation, margin annealing	(Sriramanan et al., 2020)
Dual-anchor decoding	Internal transformer layers	Logit-level positive/negative anchors	(Wu et al., 11 Apr 2026)
Graph structure attack	Nearby nodes/labels	Budgeted anchor-based perturbation	(Zhu et al., 2023)

Each realization adapts the anchor-guided principle to the available structure and attack objective.

In summary, Anchor-Guided Attack (AGA) provides a principled framework for designing adversarial attacks by explicitly leveraging reference anchors as optimization targets or calibration points. The efficacy and universality of AGA methods depend on how dominant and informative the anchors are relative to target semantics and model geometry, and recent work demonstrates both their power and structural limitations across multiple high-impact ML security domains.