Papers
Topics
Authors
Recent
2000 character limit reached

Black-Box Adversarial Attacks

Updated 25 November 2025
  • Black-box adversarial attacks are techniques that craft manipulated inputs using only model outputs, without accessing internal parameters.
  • They use methods like transfer-based, score-based gradient estimation, and decision-based queries to optimize perturbations within L_p-norm constraints.
  • Recent advances focus on improving query efficiency, ensemble transferability, and countering robust defenses to enhance attack success.

Black-box adversarial attacks are strategies for generating adversarial examples against machine learning models when the attacker has no access to model internals (weights, gradients, or architecture), relying solely on model outputs such as class labels or confidence scores. In this setting, the adversary must solve a constrained optimization to find an input xx' within a specified perturbation budget (typically LpL_p-norm) that induces misclassification. These attacks are essential in realistic threat scenarios, such as attacking deployed machine-learning APIs or proprietary systems.

1. Problem Definition and Threat Models

The black-box threat model restricts the adversary to oracle access, with two prevalent regimes:

  • Score-based (soft-label): the attacker queries the model to receive class probabilities, logits, or loss values.
  • Decision-based (hard-label): only the predicted class label is observed per query (Bhambri et al., 2019).

In both regimes, the adversary seeks a perturbed sample

x=x+δ,δpϵ,x' = x + \delta, \quad \|\delta\|_p \leq \epsilon,

such that f(x)f(x)f(x') \neq f(x) (untargeted) or f(x)=tf(x') = t (targeted), while minimizing the number of queries and constraining perceptual similarity.

Key operational paradigms are:

  • Transfer-based attacks: optimize adversarial examples on surrogate models and transfer them to the black-box target.
  • Query-based attacks: use output queries to estimate gradients (score-based) or statistically probe the decision boundary (label-only).

Considerations include loss-oracle access, perturbation norms (LL_\infty, L2L_2, L0L_0), and query budget. State-of-the-art black-box attacks must balance computational efficiency, query minimization, and success rate, especially given constraints imposed by API rate limits or defensive countermeasures (Djilani et al., 30 Dec 2024, Bhambri et al., 2019).

2. Core Algorithmic Taxonomy

Black-box attack algorithms fall into several categories, each exploiting a different facet of the information available from the target model (Bhambri et al., 2019, Wang, 2022):

1. Transfer-based Attacks

  • Utilize cross-model transferability: adversarial examples found for a surrogate are often effective on the target.
  • Typically leverage white-box optimization (e.g., PGD, FGSM, MI-FGSM) on the surrogate, sometimes integrating model ensembles to improve transfer success (Liu et al., 25 Nov 2024).
  • Core idea: minxEfF[L(f(x),y)]\min_{x'} \mathbb{E}_{f\sim \mathcal{F}} [\mathcal{L}(f(x'), y)] estimated via empirical average on surrogates.

2. Score-based Gradient Estimation

  • Estimate xL(f(x),y)\nabla_x L(f(x),y) from finite differences (e.g., ZOO), or from random directions as in NES or ES (Qiu et al., 2021, Shukla et al., 2019, Ilyas et al., 2018).
  • Natural Evolution Strategies (NES): ^f(x)=1mσi=1mf(x+σϵi)ϵi\hat \nabla f(x) = \frac{1}{m\sigma} \sum_{i=1}^m f(x + \sigma \epsilon_i)\epsilon_i, ϵiN(0,I)\epsilon_i \sim \mathcal{N}(0,I).
  • Bandit optimization variants embed temporal and data priors for improved query efficiency (Ilyas et al., 2018, Wang, 2022).
  • Square Attack applies randomized local search over LL_\infty-constrained square patches, yielding state-of-the-art query-efficiency for untargeted attacks (Wang, 2022).

3. Decision-based or Label-Only Attacks

  • Boundary Attack and its variants iteratively project adversarial examples towards the decision boundary using only class label feedback (Bhambri et al., 2019).
  • Local random search and greedy coordinate-wise approaches progressively flip labels with minimal queries.

4. Combinatorial and Evolutionary Optimization

5. Emerging Paradigms

  • Certifiable attacks: adversarial examples constructed with provable lower-bounds on success probability in the presence of randomness (Hong et al., 2023).
  • Zero-query attacks: transfer-based, requiring no interaction with the black-box at attack time by leveraging surrogate representations (Costa et al., 1 Oct 2025).

3. Major Advances, Scaling Laws, and Empirical Benchmarks

Ensemble Scaling and Transferability

A quantitative law governs the transfer success of ensemble-based black-box attacks (Liu et al., 25 Nov 2024):

ASR(T)αlogT+C,\mathrm{ASR}(T) \approx \alpha \log T + C,

where ASR\mathrm{ASR} is the attack success rate on held-out models, TT is the surrogate ensemble size, α\alpha reflects alignment with the target model, and CC is base transferability. Empirical evidence shows ASR increases logarithmically with ensemble cardinality up to saturation, across both image classifiers and large multimodal LLMs (e.g., GPT-4o), provided surrogate diversity is maintained.

Transfer-based attacks benefit from model and data diversity within the ensemble, but the scaling law fails if surrogates are out-of-distribution or if the target employs strong adversarial training. Advanced optimizers like Common Weakness Attack (CWA) are necessary for ensemble positivity at scale; naive gradient averaging stagnates (Liu et al., 25 Nov 2024).

Query-Efficient Techniques and Universality

Score-based optimization with rich priors (temporal, spatial, or surrogate-driven) yields substantial query efficiency improvements (Ilyas et al., 2018, Wang, 2022). Bandits-TD, which leverages both time and spatial priors, requires 2–5×\times fewer queries and is less failure-prone than vanilla NES or ZO-signSGD.

Universal (image-agnostic) meta-adversarial perturbations, trained by meta-learning over multiple surrogates, can initialize subsequent gradient-free attacks to substantially improve both success rate and query economy. These meta-perturbations transfer across architectures and even semantically distinct classes, demonstrating universality (Fu et al., 2022).

4. Physical, Structured, and Zero-Query Black-Box Attacks

Beyond the digital domain, black-box attacks extend to physical-world adversarial examples via structured manifolds and optimization in latent spaces:

  • Physical patch attacks: Leveraging GANs to constrain the search to printable, naturalistic patches achieves over 90% concealment on YOLO detectors, outperforming pixel-space or square baselines in both digital and real-world scenes (Lapid et al., 2023).
  • Structured and local attacks: Evolutionary methods search over interpretable parameterizations (transparent shapes, local pixel swaps), enabling query-efficient, imperceptible attacks (Art-Attack, Pixle) (Williams et al., 2022, Pomponi et al., 2022).
  • Zero-query transfer: Injecting feature maps extracted from surrogates into test inputs (ZQBA) can degrade target accuracy even with no test-time queries, transferring well across architectures and datasets (Costa et al., 1 Oct 2025).

These techniques exploit the high cross-model correlation of features, semantics, and vulnerabilities identified by deep representational layers.

5. Defenses, Limitations, and Challenges

Contemporary black-box attacks face limitations against robustly trained and randomized models:

  • Impact of robust training: Defenses tuned for strong white-box attacks (e.g., AutoAttack adversarial training) provide order-of-magnitude higher resistance to both transfer and query-based black-box attacks (Djilani et al., 30 Dec 2024).
  • Boundary Defense: Stochastic perturbation of model outputs on low-confidence (boundary) queries (BD) drastically reduces the attack success rate to near 0 with minimal accuracy loss (~1%) (Aithal et al., 2022).
  • Robustness alignment: Surrogate-target robustness alignment is crucial—robust surrogates outperform vanilla surrogates when attacking robust targets, especially for transfer-based methods (Djilani et al., 30 Dec 2024).
  • Limits: Certifiable black-box attacks can maintain \gtrsim90% certified attack success on defenders employing standard adversarial training or randomized smoothing, suggesting current defenses cannot eliminate the threat without unacceptable accuracy losses (Hong et al., 2023).
  • Saturation and breakdown: Ensemble scaling fails for out-of-distribution or strong robust models. In decision-only black-box regimes, query complexity and required perturbation magnitude both increase sharply (Liu et al., 25 Nov 2024, Djilani et al., 30 Dec 2024).

6. Practical Methodologies and Recommendations

For practical black-box security evaluation, the following recommendations emerge (Wang, 2022, Bhambri et al., 2019, Liu et al., 25 Nov 2024):

Scenario Method / Protocol Typical Query or Transfer Efficiency
Untargeted LL_\infty, score access Square Attack / Bandits-TD 32–200 queries, >>99% ASR
Transfer to similar (non-robust) models Ensemble transfer (LGV, CWA, SSA) >>90% ASR at T \geq 16 surrogates
Robust or SOTA defenses Robust surrogate-based transfer + query-based fallback <<5% ASR typical; critical
Resource-constrained, no test queries ZQBA (Zero-query attack) 20–40% accuracy drop, no queries
Targeted attacks CMA-ES (ES), TREMBA meta-embedding 500–5,000 queries, 80–98% ASR
  • Tune ensemble size up to computational limits; increase surrogate-model robustness and distributional similarity for robust targets (Djilani et al., 30 Dec 2024).
  • For defended or decision-only models, stochastic defenses, boundary noise, and certified defenses should be incorporated into evaluation pipelines to ensure adversarial robustness holds under adaptive black-box queries (Aithal et al., 2022, Hong et al., 2023).
  • Structured and semantic attacks (GAN manifolds, evolutionary art) are essential for evaluating model resilience to imperceptible and physically realizable perturbations (Lapid et al., 2023, Williams et al., 2022).

7. Open Directions and Theoretical Implications

Key unresolved challenges include:

  • Formal guarantees: Query lower bounds and information-theoretic optimality results (e.g., "NES ≈ least squares") establish that, absent strong priors, gradient estimation cannot, in expectation, be improved for a given number of queries (Ilyas et al., 2018).
  • Certified attack and defense: Certifiable black-box attacks invert the paradigm of randomized smoothing by offering theoretical lower bounds on attack success, even under randomization-based defenses (Hong et al., 2023).
  • Robust transfer: Developing transfer-based attacks that optimize explicitly on robust model families, possibly within a certified robustness framework or adversarial manifold, remains an open problem (Djilani et al., 30 Dec 2024).
  • Modality expansion: Extending scaling laws and query-efficient frameworks to video (V-BAD), audio, multimodal, and foundation models is a rapidly advancing area (Jiang et al., 2019, Liu et al., 25 Nov 2024).
  • Adaptive defenses: Model-released robust surrogates may themselves serve as attack vectors; opponent-adaptive surrogate selection becomes critical in security-sensitive deployments (Djilani et al., 30 Dec 2024).

Black-box adversarial attack research has matured into a spectrum of algorithmic, statistical, and transfer-based methodologies, with theoretical and practical implications for both attackers and defenders in deployed AI systems. Comprehensive defense requires not only robust training and input randomization but also adaptive evaluation against surrogates mirroring the real-world deployment scenario.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Black-Box Adversarial Attacks.