Transferred Adversarial Attacks

Updated 14 November 2025

Transferred adversarial attacks are methods that generate crafted perturbations on surrogate neural networks to mislead independent target models.
They leverage shared low- and mid-level representations and flat loss surface strategies to boost attack success across diverse architectures.
Advanced techniques like input transformations, frequency-domain manipulations, and feature-space attacks further improve transferability and inform robust ML defenses.

Transferred adversarial attacks, also referred to as transferability-based adversarial attacks or black-box transfer attacks, exploit the property that adversarial perturbations crafted for a specific (“source” or surrogate) neural network model can often cause misclassification in other (“target” or victim) models with different architectures, training data, or even domains. This phenomenon, originally observed in early adversarial machine learning literature and since extensively studied, underpins practical black-box threat models where the attacker cannot access the internals of the deployed (target) model.

1. Formal Definition and Threat Models

Transferred adversarial attacks are defined as the process in which adversarial examples $x^\text{adv} = x + \delta$ —where $x$ is a clean input and $\delta$ obeys a specified norm bound, e.g., $\| \delta\|_p \leq \epsilon$ —are generated to fool a surrogate model $f_s(\cdot)$ but are evaluated for their ability to alter predictions of an unknown target model $f_t(\cdot)$ (Qin et al., 2022, Richards et al., 2021, Mao et al., 2022, Petrov et al., 2019, Klause et al., 27 Jan 2025, Cox et al., 7 Nov 2025).

The canonical setting is:

White-box attack on surrogate: $\delta^* = \arg\max_{\delta\in\Delta} \mathcal{L}(f_s(x + \delta), y)$ .
Transferability criterion: $f_t(x + \delta^*) \neq y$ (untargeted) or $f_t(x + \delta^*) = y_t$ (targeted to $y_t$ ).

Transferability is quantified by the target model’s top-1 misclassification or targeted attack success rate, and is foundational in realizing query-free black-box attacks in both vision and non-vision domains (Nowroozi et al., 2023, Nowroozi et al., 2021).

2. Mechanistic Explanations and Theoretical Principles

The transferability of adversarial examples arises from overlapping adversarial subspaces and the structural similarities of learned representations between deep neural networks, even across models with different training or architectures (Qin et al., 2022, Klause et al., 27 Jan 2025, Petrov et al., 2019). Key mechanistic principles include:

Loss surface geometry: Standard attacks often yield adversarial examples stuck in “sharp minima” of the surrogate’s loss landscape. These points are highly sensitive to decision boundary shifts, causing poor transfer (Qin et al., 2022).
Flatness-driven transfer: Constraining adversarial examples to reside in flat regions of the surrogate’s loss surface (minimizing maximum loss in a local neighborhood) dramatically increases transfer success by making the perturbation robust to boundary shifts between models (Qin et al., 2022).
Shared low-/mid-level representations: Many architectures share analogous features at early and intermediate layers, making attacks targeting these representations more likely to generalize (Huang et al., 2018, Inkawhich et al., 2020).
Model and layer similarity: Centered Kernel Alignment (CKA) and other similarity metrics can be used to predict transferability: networks that are more similar at the representational level exhibit higher transfer rates, though the relationship is non-linear and depends on attack type and architectural specifics (Klause et al., 27 Jan 2025, Cox et al., 7 Nov 2025).

3. Methodological Advances in Improving Transferability

Multiple algorithmic strategies have been developed to enhance the cross-model transferability of adversarial attacks:

a) Loss Surface Manipulation

Reverse Adversarial Perturbation (RAP): RAP implements a min–max bi-level optimization, seeking adversarial examples in $L_\infty$ balls whose entire local $\epsilon_n$ -neighborhood maintains low surrogate loss. This is solved by alternating inner maximization (worst-case local ascent) and outer descent (gradient sign update) (Qin et al., 2022).
Defense-guided Max–Min Optimization: Input–affine transformations (translation, rotation, scaling) select the loss-minimizing (“defensive”) configuration at every iteration, forcing the adversarial candidate to be robust to diverse geometric variations (Zhang et al., 2020).

b) Input and Feature Diversification

Input transformations: Randomized or optimized geometric transforms (including affine, resizing, cropping, or small rotations) are applied either during attack or post-attack to avoid overfitting and inject diversity into the perturbations, boosting transfer (Wan et al., 2 Mar 2025, Zhang et al., 2020).
Feature-space attacks: Perturbing intermediate (task/data-shared) features, via representation maximization or feature distribution alignment, improves transfer success—most notably in the Intermediate Level Attack (ILA) paradigm and feature-distribution attacks (Huang et al., 2018, Inkawhich et al., 2020).

c) Frequency-Domain and Semantic Augmentation

Centralized perturbation in the frequency domain: Constraining optimization to dominant frequency coefficients with dynamic mask adaptation aligns perturbation support with model-agnostic features, reducing overfitting and improving both transferability and defense evasion (Wu et al., 2023).
Semantic injection/guided generation: Generative approaches incorporating auxiliary guiding images or curated low-frequency patterns expand the transferable region of feature space, enabling targeted, universal, and even cross-domain transfer attacks (Li et al., 2 Jan 2025, Naseer et al., 2019, Wu et al., 2019).

d) Directional and Output Space Aggregation

Direction-aggregated gradients: Smoothing update directions through aggregation across neighborhood points in input space (e.g., via random noise or transformation) stabilizes optimization and increases the likelihood of “salient” transferable directions (Huang et al., 2021, Tashiro et al., 2020).

4. Empirical Results, Benchmarks, and Quantitative Trends

Extensive empirical studies document the landscape of transferability on ImageNet, CIFAR, cloud ML APIs, and security datasets (Mao et al., 2022, Qin et al., 2022, Wu et al., 2023, Petrov et al., 2019, Zhang et al., 2020, Richards et al., 2021, Nowroozi et al., 2021, Nowroozi et al., 2023). Key results:

Transferability baseline: Canonical I-FGSM or PGD attacks yield 15–50% untargeted transfer success between strong (ResNet, VGG, Inception) architectures under common $L_\infty$ budgets (Petrov et al., 2019, Qin et al., 2022, Mao et al., 2022).
State-of-the-art with advanced methods:
- RAP and RAP-LS boost untargeted attack success by 8–17% and targeted attack success by 11–33% over matched baselines, reaching up to 99% on ensembles and major cloud APIs (Qin et al., 2022).
- Direction-aggregated and defense-guided attacks reach 85–95% on adversarially trained and defended models, outperforming classic and momentum-based iterative methods (Huang et al., 2021, Zhang et al., 2020).
- Simple 1° input rotations yield +6.5pp average improvement, up to +26.5pp on black-box ImageNet, in over 84% of evaluated attack/model triples (Wan et al., 2 Mar 2025).
- Centralized frequency attacks add +11.7% absolute transfer to black-box fooling, improving robustness under compression and quantization defenses (Wu et al., 2023).
- Output-diversified sampling strategies halve query counts for black-box attacks on ImageNet (Tashiro et al., 2020).
Surrogate selection: No single “best” architecture guarantees maximal transfer on real-world APIs; surrogate depth and diversity have non-monotonic, dataset-specific effects (Mao et al., 2022).
Posterior gap vs. logit gap: Posterior gap (softmax margin) on the surrogate is a much stronger predictor of transfer than raw logit gap, with $R^2 \sim 0.8$ on real APIs (Mao et al., 2022).
Metric and norm effects: $L_2$ perturbations, even those generated without gradients, sometimes outperform $L_\infty$ -constrained attacks in cross-model settings (Mao et al., 2022). SSIM-aligned constraints are more reliable for perceptual similarity than $L_\infty$ (Petrov et al., 2019).

5. Applications and Domain-Specific Manifestations

Vision

Transfer attacks have been most intensively studied on classification, segmentation, and transfer learning tasks using ImageNet, CIFAR, PASCAL VOC, and Cityscapes. SegTrans demonstrates state-of-the-art transfer rates for segmentation via local semantic fragmentation (Song et al., 10 Oct 2025).
Transferability is unchanged or amplified under MLaaS APIs, where model, data, and pretraining diversity is further compounded (Mao et al., 2022).

Non-Vision (Cybersecurity and Domains)

Transferability also holds in non-image domains, including network intrusion detection and flow classification, but with major caveats: success is contingent on feature domain, model architecture, and attack style (Nowroozi et al., 2023, Nowroozi et al., 2021). For CNN-based flow detectors, transfer is highest for JSMA and I-FGSM; PGD and DeepFool transfer poorly (Nowroozi et al., 2021).
Cross-domain generative domain-invariant attacks demonstrate that adversarial perturbations trained on non-overlapping domains (e.g., “Paintings”) retain strong fooling capability against ImageNet models, contingent on relativistic contrastive losses (Naseer et al., 2019).

6. Risk Quantification, Predictive Modeling, and Defensive Implications

Exhaustive risk coverage is infeasible due to the exponential size of high-dimensional input (adversarial) space (Cox et al., 7 Nov 2025).
Surrogate set selection via CKA: Combining surrogates with high and low CKA similarity maximizes adversarial subspace coverage, allowing institutions to estimate true risk from a small but strategically chosen model pool (Cox et al., 7 Nov 2025, Klause et al., 27 Jan 2025).
Regression-based risk estimation: Transfer success rates from surrogates, passed to a regression estimator, achieve $R^2\approx 0.85$ against actual target risk, offering a pragmatic solution for compliance and ML security evaluation in the face of transfer complexity (Cox et al., 7 Nov 2025).
Defensive recommendations:
- Address class and data overlap carefully—adversarial training assuming full overlap can decrease robustness if the actual overlap is less than 1 (Richards et al., 2021).
- Shielding via adversarial fine-tuning on the most transferable attacks (MPAs), architectural diversity (e.g., LSTM + CNN), or CKA-targeted adversarial training show high efficacy (Nowroozi et al., 2021, Cox et al., 7 Nov 2025).
- Diversity-driven, non-aligned ensembles and input transformations can suppress the cross-architecture alignment that enables high transferability (Zhang et al., 2020, Song et al., 10 Oct 2025).

7. Open Problems and Future Directions

Theoretical bounds: Tight theoretical characterizations of surrogate-to-target boundary alignment and subspace overlap under various loss landscapes (flatness, sharpness) remain underdeveloped (Qin et al., 2022, Klause et al., 27 Jan 2025).
Adaptive and certifiable defenses: Dynamic defenses that adapt to feature-space adversarial subspaces, or that provide certified guarantees against perturbed regions shared by diverse families, are highly sought after (Zhang et al., 2020, Naseer et al., 2019, Wu et al., 2023).
Attack extensions: Extending transfer attack methodologies to new norms (e.g., $L_2$ or perceptual), modalities (audio, text), generative models, and other complex output tasks (detection, language) is an active direction (Qin et al., 2022, Song et al., 10 Oct 2025, Li et al., 2 Jan 2025).
Surrogate diversity vs. computational cost tradeoffs: Optimal balancing of surrogate pool diversity and attack coverage, especially under constraints on attack or test-time resources, requires further investigation (Cox et al., 7 Nov 2025, Mao et al., 2022).
Semantic and domain-invariant approaches: Plug-and-play semantic injection and unsupervised domain alignment promise new trans-domain security risks but also create new opportunities for characterizing universal adversarial directions (Li et al., 2 Jan 2025, Naseer et al., 2019, Wu et al., 2019).

Transferred adversarial attacks remain central to practical ML red teaming, robust network evaluation, and adversarial risk certification. The design of transfer-resistant models and comprehensive testing procedures continues to be both a technical and regulatory imperative for machine learning security research.