Transfer Attack Success Rate: Metrics & Mechanisms

Updated 4 August 2025

Transfer attack success rate is a metric that quantifies how adversarial examples generated for a surrogate model mislead a target model in black-box settings.
Methodologies involve gradient transfer, latent space embedding, adaptive transformation ensembles, and meta-learning to improve transferable perturbations.
Empirical benchmarks demonstrate high success rates under controlled conditions, highlighting the importance of model similarity and feature alignment for robust adversarial evaluations.

Transfer attack success rate is a central metric in adversarial machine learning, quantifying the ability of adversarial examples crafted for one ("source" or "surrogate") model to successfully induce misclassification in another ("target" or "victim") model without requiring access to its parameters or gradients. This criterion is critical for black-box attacks, threat modeling, evaluation of model robustness, and the practical security assessment of AI/ML systems.

1. Definition and Mathematical Formulation

Transfer attack success rate (ASR) is typically defined, for a set of adversarial examples $\{\tilde{x}_i\}$ generated on a surrogate model $f_s$ , as the proportion that fool a target model $f_t$ :

$\text{ASR} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(f_t(\tilde{x}_i) \neq y_i)$

for untargeted attacks, or

$\text{ASR}_{\text{targeted}} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(f_t(\tilde{x}_i) = y^*_i)$

where $y^*_i$ is the attacker-chosen target label. Transferability captures the extent to which adversarial examples, optimized to fool a source model under some perturbation constraint (e.g., $L_\infty$ -norm), also succeed in fooling unrelated models.

Variants of this metric include top- $k$ attack success rate (the fraction of times the true label is not within the top- $k$ predictions) and other indices based on label rank (Zhang et al., 2022). For certain settings, the transferability gain is further normalized to account for distortion, as in:

$T_{s,t} = \frac{\int_0^\infty (P_{s,t}(u) - P_t^{BB}(u))\,du}{\int_0^\infty (P_t^{WB}(u) - P_t^{BB}(u))\,du}$

where $P_{s,t}(u)$ is the operating characteristic giving the fraction of samples with perturbation < $u$ for the transfer attack, $P_t^{BB}$ for the black-box decision-based attack, and $P_t^{WB}$ for the white-box attack (Maho et al., 2023).

2. Determinants and Mechanisms of Transferability

Transfer attack success is shaped by both representational and procedural factors, including intrinsic model similarity, the choice of attack/perturbation generation method, and auxiliary data transformations:

Feature Representations: Transferability is high when source and target models share similar internal representations or learn comparable features and vulnerabilities. This relationship can be quantified empirically using normalized symmetric Hausdorff distances between low-dimensional manifold embeddings (Dale et al., 6 Dec 2024), Centered Kernel Alignment (CKA), or Diagonal Box Similarity (DBS) for per-layer analysis (Klause et al., 27 Jan 2025).
Attack Generation Algorithms:
- Gradient Transfer and Embedding Search: Algorithms such as TREMBA (Huang et al., 2019) construct a low-dimensional embedding of semantic adversarial perturbations via an encoder–decoder generator trained on a surrogate. Searching for adversarial updates in this space (using Natural Evolution Strategies) yields highly transferrable adversarial patterns due to shared semantic features.
- Transformation Ensemble: Approaches including AITL (Yuan et al., 2021) and S⁴ST (Liu et al., 13 Oct 2024) employ adaptive or highly optimized transformation pipelines (scaling, cropping, color adjustments) to boost the overlapping gradient directions between surrogate and target models, markedly increasing transfer ASR.
- Meta-Learning: Meta-optimization across model and data augmentations, such as in LLTA (Fang et al., 2021), causes perturbations to generalize across multiple tasks, further boosting transferability.
- Evolution and Genetic Algorithms: Genetic optimization strategies, especially those preserving or mimicking attribution maps (as in QuScore (Abdukhamidov et al., 2023)), achieve high transferability in the context of interpretable models.
- Co-Adaptation Disruption: DropConnect-based methods (Su et al., 24 Apr 2025) disrupt overfitted dependency patterns in the perturbation, diversifying across model variants and thus preventing overfitting to the surrogate.
- Layer/Crop Selection: Recent works for vision-language or transformer models utilize targeted perturbation of semantically rich local regions or discovery of most vulnerable layers (e.g., DTA (Zheng et al., 3 Aug 2024), local-aggregated perturbations (Li et al., 13 Mar 2025)), demonstrating high cross-architecture transfer rates.

3. Empirical Benchmarks and Quantitative Results

Empirical results consistently show that carefully designed transfer attacks can achieve high ASRs under black-box constraints, with substantial variance depending on the method and scenario:

Method/Setting	Average ASR	Context/Notes
TREMBA (Huang et al., 2019)	98% (MNIST); 98.5% (ImageNet); +10% on defended models	NES-based embedding search; high sem. transferability
DeepPoison (Chen et al., 2021)	91.74% (with 7% poisoning)	GAN-based stealthy poisoning, feature-level triggers
AITL (Yuan et al., 2021)	90–96% (ImageNet); +15% over baselines	Adaptive transformation learner
LLTA (Fang et al., 2021)	12.85% higher than SOTA	Meta-learning across model/data augmentation
SU Attack (Wei et al., 2022)	+12% improvement	Self-universality, feature similarity loss, targeted
S⁴ST (Liu et al., 13 Oct 2024)	77.7–83% (targeted); +14% over H-Aug	Scaling/augmentation/blockwise local strategies
DTA (Zheng et al., 3 Aug 2024)	>90% (ViT downstream)	Token cosine similarity loss, per-sample attacks
MCD (Su et al., 24 Apr 2025)	+13% (CNN→Transformer)	DropConnect self-ensemble against targeted models
Commercial LVLMs (Li et al., 13 Mar 2025)	>90% (GPT-4.5, 4o, o1, etc.)	Local-aggregated semantic perturbations, ensemble
IDS/Feature Mismatch (Ennaji et al., 11 Apr 2025)	Varies (capped by TFS: α·f_align+β·A_sim+γ·D_hom)	Sensitive to feature/arch/data divergence

Surrogate ensemble attacks (Levy et al., 2022) and dynamic source selection (FiT) (Maho et al., 2023) can further drive ASR to near 100% for best-case selection, but random surrogate choice can underperform black-box attacks.

4. Model Similarity, Overfitting, and Predictability

The degree of representational and architectural similarity between source and target models is a leading determinant of transfer ASR:

Moderate global similarity (mean CKA ≈ 0.45) with variability across architectures; DenseNet and deeper networks can exhibit lower similarity yet greater vulnerability as attack targets (Klause et al., 27 Jan 2025).
Co-adaptation among features narrows the transferability; strategies that disrupt or diversify this co-adaptation (e.g., DropConnect, pseudo-victim bilevel feedback (Liu et al., 4 Jun 2024)) yield more transferable perturbations.
Predictive models (e.g., DecisionTreeRegressor) trained on metrics such as CKA/DBS + layer counts can predict transfer ASR for certain black-box/C&W attacks with >90% accuracy (Klause et al., 27 Jan 2025), but accuracy drops for more complex scenarios.

Manifold alignment or cross-projection of feature embeddings (Dale et al., 6 Dec 2024) shows that lower manifold distance (Hausdorff) correlates (ρ ≈ -0.56) with increased transfer ASR, supporting a common-vulnerability hypothesis.

5. Impact of Transformations, Embeddings, and Augmentation

Transfer attack success rate improves with strategies that:

Employ low-dimensional/semantic embeddings: Sampling/optimizing perturbations in the learned latent space rather than pixel space increases both ASR and query efficiency (Huang et al., 2019).
Integrate tailored or adaptive transformation ensembles: Operations such as image resizing, cropping, color augmentation, and blockwise scaling aligned with the threat model increase gradient alignment and mitigate overfitting (Yuan et al., 2021, Liu et al., 13 Oct 2024).
Optimize for universality across spatial regions instead of across images: Promoting local invariance (e.g., self-universality) produces features that are robust to network and spatial variation, improving transfer for targeted and untargeted attacks (Wei et al., 2022).
Combine data and model augmentation: Meta-learning over a mixture of model variants (via backprop modification, dropout, or architectural stochastics) improves generalization over possible target models (Fang et al., 2021, Su et al., 24 Apr 2025).

6. Limitations, Realism, and Defensive Considerations

While transfer ASR can be close to perfect under ideal lab conditions (aligned features, similar data distributions, comparable architectures), real-world environments impose constraints:

Architectural or feature-set divergence can sharply reduce transfer viability. The Transferability Feasibility Score (TFS) (Ennaji et al., 11 Apr 2025):

$\text{TFS} = \alpha f_{\rm align} + \beta A_{\rm sim} + \gamma D_{\rm hom}$

where $f_{\rm align}$ (Jaccard overlap), $A_{\rm sim}$ (normalized parameter difference), and $D_{\rm hom}$ (Wasserstein data distance) are regression-calibrated predictors of ASR. Negative architectural similarity and moderate data distance often limit ASR in practical IDS scenarios.

Defensive strategies leveraging architectural heterogeneity, intentional data shifts, or sophisticated detection may reduce practical transfer ASR even as laboratory methods report high values.

This creates a mismatch between theoretical and practical transferability: high ASR in controlled experiments may not always carry over to real, heterogenous environments.

7. Future Trends and Predictive Modeling

Research increasingly seeks to:

Develop blind predictors of transferability based on feature correspondence and manifold alignment, allowing a priori vulnerability estimation for black-box targets (Dale et al., 6 Dec 2024).
Enhance transferability by further diversifying perturbation pathways (e.g., novel transformations, task-level augmentations, bilevel and meta-optimization frameworks (Liu et al., 4 Jun 2024)).
Better quantify and balance the trade-off between attack success, perturbation imperceptibility, and computational efficiency (distortion-aware metrics (Maho et al., 2023)).
Adapt transferability frameworks for cross-modal and non-vision domains, building on insights from vision models to IDS (Ennaji et al., 11 Apr 2025) and vision-LLMs (Li et al., 13 Mar 2025).

Transfer attack success rate remains a multi-faceted, context-dependent measure—sensitive to model similarity, architectural choices, feature alignment, and the sophistication of attack optimization and evaluation criteria. Its ongoing refinement is central to both adversarial research and the robust design of learning systems.