Transferred Black Box Attacks

Updated 21 November 2025

Transferred black-box attacks are adversarial techniques that use surrogate models to craft perturbations exploiting shared non-robust features for misclassification.
They combine classical methods like FGSM and ensemble attacks with Bayesian and latent-space innovations to improve query efficiency and transfer success.
They are applied across vision, language, speech, and security domains, posing significant challenges for defensive strategies in real-world ML systems.

Transferred black-box attacks are a central paradigm in adversarial machine learning, exploiting the empirical phenomenon that adversarial examples crafted on one (typically white-box) model frequently induce misclassification when presented to an independently trained (and entirely black-box) target. Transferability underpins practical black-box attacks across vision, language, speech, and security domains, enabling adversaries to compromise ML systems without requiring access to internal parameters or gradients. The theoretical and methodological landscape of transferred black-box attacks encompasses classical transfer-based pipelines, Bayesian and evolutionary extensions, attack efficiency improvements, as well as unified frameworks for quantifying transfer risk and evaluating attack potency across architectures and tasks.

1. Fundamental Principles and Definitions

The canonical transferred black-box attack proceeds in two stages: (1) adversarial examples are generated by optimizing a surrogate model (or ensemble) accessible to the attacker (white-box), typically via maximizing a surrogate loss $\ell(s, x+\delta, y)$ subject to a distortion constraint; (2) the crafted perturbation $\delta$ is added to $x$ and the resulting adversarial $x^* = x+\delta$ is presented to the black-box target $f_t$ , yielding misclassification if $f_t(x^*) \neq f_t(x)$ (Papernot et al., 2016, Cox et al., 7 Nov 2025). No query information, except potentially label output, is used at generation time.

Formally, the transferability metric is

$T(f_t, s) = \mathbb{E}_{(x,y)\sim D} \left[ \mathbf{1} \{ f_t(x + \delta_s(x)) \neq f_t(x) \} \right]$

where $\delta_s(x)$ is optimized on the surrogate $s$ (Cox et al., 7 Nov 2025).

A key property is that transferability—empirically high even across disjoint architectures and training sets—is driven by non-robust, shared features, the geometry of decision boundaries, and the (mis)alignment between surrogate and target model representations (Papernot et al., 2016, Zhang et al., 2024, Djilani et al., 2024, Jalwana et al., 11 Apr 2025).

2. Classical Pipelines and Algorithmic Frameworks

Substitute Training: The foundational recipe, sometimes termed the "substitute transfer attack," starts with a small seed set $S_0$ ; the black-box target is queried for labels to create an initial labeled set, then a substitute model $\hat{f}$ is trained. Augmentation (e.g., via Jacobian-based methods) and reservoir sampling can be used to minimize the number of required queries. Once a sufficiently accurate $\hat{f}$ is available, attacks such as FGSM or PGD are launched against it, and adversarial samples are transferred to the target (Papernot et al., 2016).

FGSM and Iterative Methods: Fast Gradient Sign Method (FGSM) and its iterative extensions (BIM/I-FGSM) generate perturbations by leveraging the gradient of the surrogate loss. Stronger attacks (more iterations, higher confidence) tend to have higher white-box success and, under well-calibrated parameterization, also transfer better at moderate perceptual distortion levels (Petrov et al., 2019).

Ensemble-Based Attacks: By crafting adversarial examples against an ensemble of surrogates, one can diversify decision boundary coverage, improving transferability especially when topology differs between surrogate(s) and target. This is particularly effective when ensemble members span multiple architectures (Petrov et al., 2019, Cox et al., 7 Nov 2025).

3. Advanced Methodological Innovations

3.1. Latent-Space and Embedding-Based Approaches

Generator-Embedding Attacks: TREMBA (Huang et al., 2019) and similar work (Zhang et al., 2022) decouple adversarial search from raw input space by learning a low-dimensional embedding (via autoencoder or Glow-style flows) to encode perturbations' "semantic" patterns. The generator is pretrained against surrogates; at attack time, query-efficient optimization (e.g., NES in latent space) seeks a perturbation likely to transfer. These attacks yield state-of-the-art query efficiency and maintain high transferability under various defenses.

Partially-Transferred Distributions: "Partial transfer" methods (Feng et al., 2020) transfer only the high-capacity "flow" parameters of a pretrained adversarial distribution (e.g., c-Glow) from surrogate to target, adapting mean and scale with black-box queries to reduce surrogate bias and retain flexibility, yielding superior query-attack performance even in open-set scenarios.

3.2. Diversity-Guided and Bayesian Techniques

Output Diversified Sampling (ODS): ODS (Tashiro et al., 2020) prioritizes perturbation directions that maximize output (logit) diversity on the surrogate, resulting in samples that span a broader region of the target's loss landscape. Empirical evidence shows that logit-space diversity "transfers," yielding 2–3× query savings versus random or uniformly sampled directions.

Bayesian MAP/Variational Approaches: BayAtk (Fan et al., 2022) reinterprets transferability as Bayesian MAP inference over a space of neighboring (semantically-invariant, loss-preserving) inputs. MaskBlock is a transferability-promoting prior which zeros out random image regions during gradient computation, simulating local manifold uncertainty and boosting success rates against both undefended and robust targets by 5–20% compared to classical enhancement techniques.

Risk Quantification: The CKA-based framework (Cox et al., 7 Nov 2025) systematically quantifies adversarial transfer risk by measuring transfer rates using surrogate models selected to span both high and low alignment (via Centered Kernel Alignment) with the target. A regression-based estimator maps these transfer rates to a single risk score, providing a repeatable basis for risk assessment even in high-dimensional settings.

4. Empirical Results and Target Domains

Empirical transfer rates remain high in vision: transfer attacks achieve up to 70–85% success across ImageNet architectures when using calibrated iterative methods and ensemble surrogates at equalized SSIM (Petrov et al., 2019). Bayesian MaskBlock increases cross-architecture success by 5–15 percentage points (Fan et al., 2022). Advanced methods—generator-based, output-diversification, or hybrid variants—demonstrate order-of-magnitude query reductions without loss of efficacy (Huang et al., 2019, Yang et al., 2020, Tashiro et al., 2020).

Defenses such as adversarial training, input detection, NULL labeling, and Detect & Reject can dramatically reduce transferred attack efficacy, sometimes to near zero for moderate perturbations (Hosseini et al., 2017, Debicha et al., 2021). However, robust surrogate selection or query-efficient latent-space search can still recover partial transfer success even against these defenses (Djilani et al., 2024, Feng et al., 2020).

4.2. Speech, Language, and Security

ASR and Voice ID: Transferred black-box attacks maintain high transferability in audio pipelines because preprocessing (DFT, MFCC, log-Mel) is standardized. Transform-threshold and time-domain momentum-based attacks systematically degrade transcription (WER up to 49% on targets) while remaining virtually undetectable to humans (Abdullah et al., 2019, Gao et al., 2024). Cross-model transfer probabilities reach 40–100% when perturbations are crafted on state-of-the-art commercial APIs.

Network Intrusion Detection: Transferability arises from the reliance of ML-based NIDSs on non-robust, highly sensitive features shared across differentiated and non-differentiable models (RF, XGBoost). The ETA framework leverages Shapley-value-driven ISFS to amplify transfer even in mixed-architecture ensembles, with cross-model transfer rates up to 100% on critical attack classes (Zhang et al., 2024, Debicha et al., 2021).

LLM Jailbreaks: Transferable black-box jailbreak attacks leverage prompt optimization on surrogate LLMs (e.g., GPT-4, Qwen-Max) with ensemble and semantic-coherence–disruption steps ("stealth insertion") to achieve >90% jailbreak rates on aligned target LLMs. Adaptive resource allocation by prompt difficulty and systematic ensemble selection are crucial for transferability in this domain (Yang et al., 2024).

5. Key Factors Driving and Limiting Transferability

Factor	Influence on Transferability	Context/Notes
Non-robust/sensitive features	Strongly increase	Shared across models, especially in NIDS/vision
Surrogate-target alignment	Strongly increase	CKA/robustness-matched surrogates best (Djilani et al., 2024)
Dataset/label prior overlap	Strongly increase (inflated result)	Overlap in D_S, C_S and D_T, C_T must be avoided (Jalwana et al., 11 Apr 2025)
Ensemble surrogates	Increase	Diversifies boundary coverage (Petrov et al., 2019, Cox et al., 7 Nov 2025)
Random/semantic augmentations	Increase	MaskBlock, PATCH, context-aware attacks
Stronger white-box attack	Non-monotonic, but top-k ASR generally increases	Overly confident attacks can reduce transfer under top-1 metric (Petrov et al., 2019, Djilani et al., 2024)
Defensive training/NULL label	Usually reduce (often to near-zero rates)	Particularly effective against naive transfer attacks (Hosseini et al., 2017, Djilani et al., 2024)

6. Generalization, Practical Pitfalls, and Best Practices

Avoiding Inflated Results: Literature overwhelmingly overestimates transferability when overlapping training corpora and label sets are used for surrogates and targets. Realistic evaluation requires prior-free settings: strictly no overlap between surrogate and target datasets or label sets, with no access to target classes (Jalwana et al., 11 Apr 2025).

Evaluating Transferability: When reporting black-box transferability,

Use white-box ensembles or diversified surrogates if possible.
Measure empirical transfer on held-out, unseen data, reporting both attack success rate and perceptual distortion (e.g., SSIM, WER).
Report true prior-free and shared-prior baselines to contextualize results (Jalwana et al., 11 Apr 2025, Cox et al., 7 Nov 2025).

Query-Efficient Extensions: Several modern methods (TES, TREMBA, LeBA) combine transfer-based priors with limited target queries, iteratively adapting to the target's response in low-dimensional latent or feature space. These hybrid approaches achieve high rates at far lower query budgets (<100–500) than direct black-box optimization (Huang et al., 2019, Yang et al., 2020, Zhang et al., 2022).

Defensive Implications: Defenders can exploit detection/rejection mechanisms, align robustness via adversarial training, or employ NULL labeling to severely restrict transferability without materially impacting clean accuracy (Hosseini et al., 2017, Debicha et al., 2021, Djilani et al., 2024). Explicitly regularizing on robust or orthogonal features, or employing certified defenses, further enhances resistance.

7. Open Challenges and Future Directions

Challenges remain in universal black-box transfer (cross-modal, cross-domain, pure prior-free); ensuring transfer against highly robust or certified models; reducing surrogate bias without sacrificing attack generality; and formally understanding when and why diversity, latent-space modeling, or Bayesian perturbation sampling guarantees transfer. There is a pressing need for standardized protocols—such as the prior-free framework (Jalwana et al., 11 Apr 2025), CKA-guided risk evaluation (Cox et al., 7 Nov 2025), and robustness-alignment–aware benchmarking (Djilani et al., 2024)—to ensure genuine, reproducible progress in both attack and defense research for real-world ML systems.