Transfer Attacks in Adversarial ML
- Transfer attacks are adversarial strategies that generate perturbations using a surrogate model to mislead a black-box target model.
- They leverage properties like gradient alignment and loss landscape variability to achieve high attack success rates across different architectures.
- These attacks expose vulnerabilities in ML systems, inspiring countermeasures and robust defenses in domains such as computer vision, NLP, and code models.
Transfer attacks are adversarial attacks where perturbations crafted to fool a source (“surrogate”) model are deployed against a distinct, usually black-box, target model. Transfer attacks exploit the transferability property of adversarial examples—the empirical phenomenon that adversarial examples generated with respect to one model’s inputs and learned decision boundaries often cause misclassification by other, even architecturally unrelated, models. As the dominant practical threat vector in black-box settings where direct access to victim model internals is unavailable, transfer attacks have become a central focus in the evaluation of the robustness of machine learning systems, affecting computer vision, natural language processing, and code models.
1. Formal Definitions and Theoretical Foundations
A transfer attack is defined as a two-stage process:
- Attack generation: Given a surrogate model (fully accessible to the attacker), an adversarial example is computed such that for original label and norm-constrained perturbation .
- Attack transfer: The adversarial example is then evaluated on a target black-box model ; the attack is successful if .
The transferability of such attacks can be formalized as the target model’s loss induced by the surrogate’s adversarial perturbation: where denotes the parameters of the target and is optimal for the surrogate (Demontis et al., 2018).
Key quantitative metrics include:
- Attack Success Rate (ASR): Fraction of adversarial examples that induce misclassification on the target.
- Transfer ASR (T-ASR): Evaluated across all attacks and models, reflecting black-box robustness (Yang et al., 25 Feb 2025).
Transferability is fundamentally governed by:
- The intrinsic adversarial vulnerability (norm of input gradients) of the target.
- The gradient alignment (cosine similarity) and complexity between surrogate and target.
- The loss-landscape variability of the surrogate, indicating attack stability (Demontis et al., 2018).
These provide a basis for bounding (or explaining) black-box vs. white-box effect asymptotics and suggest practical strategies for attacker surrogate selection and defensive regularization. Notably, the size of the L2-norm of perturbations, not the L-infinity norm, proves most predictive for real-world success (Mao et al., 2022).
2. Methodologies for Transfer Attack Generation
Transfer attacks subsume a variety of families (Zhao et al., 2022, Zhang et al., 2023, Guesmi et al., 26 May 2025):
| Category | Representative Methods | Key Principles |
|---|---|---|
| Gradient Stabilization | MI-FGSM, NI-FGSM | Momentum, Nesterov steps |
| Input Augmentation | DI, TI, SI, VT, Admix | Diverse transforms |
| Feature Disruption | TAP, FIA, ILA, NAA | Intermediate-layer targeting |
| Surrogate Refinement | SGM, LinBP, RFA, IAA, DSM | Surrogate model tweaks |
| Generative Modeling | GAP, CDA, GAPF, Dual-Flow | One-shot generator-based |
Iterative methods such as MI-FGSM accumulate gradient momentum, early stopping at ~10 iterations for transfer. Augmentation approaches (DI, TI) diversify gradients by random resizing and translations, capturing transformations the target may encounter. Feature disruption attacks optimize feature activations at selected layers, with the choice of layer (e.g., conv3_x for ResNet) critical for transfer. Generative methods like Dual-Flow (Chen et al., 4 Feb 2025) and GAP (Zhao et al., 2022) produce instance-agnostic, multi-target perturbations with explicit conditioning and distribution-shift training, exhibiting notable cross-model and cross-task generalization.
Specialist frameworks handle transfer in other domains:
- Transferable availability poisoning attacks optimize over both cross-entropy and contrastive (alignment/uniformity) objectives to degrade accuracy of any possible victim model trained on poisoned data (Liu et al., 2023).
- Transferable adversarial prompting is central to jailbreaking LLMs, where transfer is determined by prompt region overlap; loss-based constraints on source models often severely restrict transferability to target LLMs (Yang et al., 25 Feb 2025).
3. Architectural and Domain-Specific Transfer
Transferability varies by domain, task granularity, and model class:
- Vision Transformers (ViTs): Specialized attacks (e.g., Token Gradient Regularization (Zhang et al., 2023), TESSER (Guesmi et al., 26 May 2025)) leverage token-wise importance, spectral regularization, and variance reduction, yielding large ASR gains over classical CNN-based methods in both ViT-to-ViT and ViT-to-CNN transfer.
- Object Detection and Segmentation: Transfer attacks must account for inter-object context, spatial dependencies, and per-pixel or per-instance predictions. Context-aware plans and translation-invariance/ensemble approaches have yielded significant success rate improvements for black-box detector attacks and semantic segmentation (Cai et al., 2021, He et al., 2023).
- Code Models and LLMs: Transferable code attacks use semantics-preserving mutators applied on white-box surrogates, with perturbations disrupting code understanding in GPT-4, Claude, and Llama-family models (Zhang et al., 2023). Prompt-based defenses (few-shot, explicit reverse instructions) are effective countermeasures.
Transfer learning introduces specific avenues for transfer attacks, such as headless attacks, which perturb feature extractors alone, bypassing the downstream classifier head entirely (Abdelkader et al., 2020). Downstream transfer attacks target fine-tuned models (e.g., via ViTs), using similarity-based objectives at internal layers (Zheng et al., 2024).
4. Empirical Insights, Limitations, and Defense
Empirical studies have revealed several nontrivial findings:
- Intermediate model complexity surrogates (e.g., ResNet-34, VGG-16) exhibit greater transfer than shallow or deep variants (Mao et al., 2022).
- Model family is not a universal predictor of transfer; no architecture consistently dominates (Mao et al., 2022).
- Architectural mismatch between surrogate and victim induces the largest drop in transferability (up to 95%→15%) (Alecci et al., 2023).
- L2 norm dominance: L2-constrained perturbations yield higher transfer rates than L-infinity, and even random search can outperform certain gradient-based algorithms (Mao et al., 2022).
- Data characteristics influence transfer: dataset origin, class balance, and training distributions have additive penalties. For example, mismatches in training balance or data source can decrease ASR by ~20%, but are less significant than architecture (Alecci et al., 2023).
- Stealthiness trade-offs must be addressed in practice: high-transfer attacks tend to be less perceptually subtle, with FID, SSIM, and LPIPS scores offering complementary evaluation to traditional norm-bounded metrics (Zhao et al., 2022).
Recent defenses against transfer attacks include:
- Minimax, game-theoretic training: PubDef trains models against transfer attacks from a diverse set of public surrogates, outperforming classical adversarial training by 20–26pp on robust accuracy in ImageNet settings (Sitawarin et al., 2023).
- Data-centric one-shot augmentation: DRL generates a diverse pool of adversarial examples on a surrogate before target model fitting, matching or exceeding robustness of iterative AT methods with much lower computational cost (Yang et al., 2023).
- Prompt-based LLM defenses: Inclusion of adversarial code pairs and restoration instructions in prompts mitigates transferability of code attacks (Zhang et al., 2023).
5. Open Challenges and Future Directions
Open research directions include:
- Optimally choosing surrogates for black-box transfer, and formalization of capacity/similarity metrics beyond gradient alignment.
- Frequency domain and spectral approaches: Centralized perturbation attacks focus on low/mid-frequency components, aligning with shared model representations (Wu et al., 2023).
- Algorithmic frameworks such as bilevel optimization (BETAK), which coordinate initialization and hypergradient response to explicitly maximize transferability subject to surrogate adaptation (Liu et al., 2024).
- Physical-world and context-driven attacks: Extending transferability to real-world and context-dependent settings (e.g., physically realizable object detector attacks, jailbreaking LLMs under safety layers) (Cai et al., 2021, Yang et al., 25 Feb 2025).
- Defense-evasion arms race: Most current defenses are not robust to unknown, unconstrained transfer attacks, especially where non-L-infinity norms or novel domain transfer mechanisms are used.
A plausible implication is that as models, training data, and attack algorithms continue to diversify, both attackers and defenders must move towards ensemble and structure-aware strategies that reflect the practical heterogeneity of real-world deployment scenarios.
6. Comprehensive Evaluation and Best Practices
A systematic evaluation of transfer attacks should consider both transferability and stealthiness using diverse metrics (ASR, PSNR, SSIM, LPIPS, FID), sweeping hyperparameters (iteration count, augmentation multiplicity, target layers) (Zhao et al., 2022). Best practices include:
- Early stopping in iterative attacks for optimal transfer.
- Hyperparameter fairness across categories (e.g., same step-size/copy count for augmentation methods).
- Defense evaluation against unseen attack types and attack signature analysis (e.g., using diagnostic classifiers to fingerprint attack generation processes).
- Release of code, standardized benchmarks, and attack lists to promote reproducibility (Zhao et al., 2022).
References
- (Mao et al., 2022) Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings
- (Zhao et al., 2022) Towards Good Practices in Evaluating Transfer Adversarial Attacks
- (Zhang et al., 2023) Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization
- (Liu et al., 2023) Transferable Availability Poisoning Attacks
- (Yang et al., 25 Feb 2025) Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
- (Chen et al., 4 Feb 2025) Dual-Flow: Transferable Multi-Target, Instance-Agnostic Attacks via In-the-wild Cascading Flow Optimization
- (Wu et al., 2023) Towards Transferable Adversarial Attacks with Centralized Perturbation
- (He et al., 2023) Transferable Attack for Semantic Segmentation
- (Zhang et al., 2023) Transfer Attacks and Defenses for LLMs on Coding Tasks
- (Zheng et al., 2024) Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers
- (Guesmi et al., 26 May 2025) TESSER: Transfer-Enhancing Adversarial Attacks from Vision Transformers via Spectral and Semantic Regularization
- (Sitawarin et al., 2023) PubDef: Defending Against Transfer Attacks From Public Models
- (Alecci et al., 2023) Your Attack Is Too DUMB: Formalizing Attacker Scenarios for Adversarial Transferability
- (Yang et al., 2023) Towards Deep Learning Models Resistant to Transfer-based Adversarial Attacks via Data-centric Robust Learning
These works collectively define the state of the art and theory in transfer attacks, their practical methodologies, empirical properties, and countermeasures.