Decision-Based Black-Box Adversarial Attacks
- Decision-based black-box adversarial attacks are defined by using only discrete output labels to identify minimal perturbations for misclassification.
- Techniques such as boundary, evolutionary, and RL-based methods enable efficient exploration of adversarial directions under strict query constraints.
- Empirical results demonstrate high success rates and sparsity improvements, questioning the presumed robustness of output-restricted models.
Decision-based black-box adversarial attacks are a class of adversarial machine learning attacks in which the adversary is limited to querying a model and observing only the model’s final output (typically, the top-1 predicted label), without access to logits, confidence scores, or model internals. These attacks pose significant threats to deployed systems such as ML-as-a-service APIs, vision, trajectory prediction, and text models, where exposing only a discrete decision is often assumed to improve robustness. Contrary to this assumption, decision-based attacks have demonstrated the ability to efficiently craft imperceptible or highly sparse adversarial perturbations under strict information constraints, raising serious concerns for real-world AI security.
1. Formal Problem Definition and Threat Model
A decision-based black-box adversarial attack operates under a strictly limited information regime: the adversary can query an input and observe only the top-$1$ label of a classifier , with no access to scores, probabilities, or gradients. The canonical task is, given a clean input with true label , to find a minimally perturbed such that (untargeted) or (targeted), typically under a norm constraint (e.g., , $1$0, or $1$1):
$1$2
or in the targeted case $1$3, where $1$4 is the norm of interest.
This setting is strictly harder than white-box or score-based black-box attacks, as the adversary cannot exploit gradient information or continuous output to efficiently find adversarial directions. Instead, the search is combinatorial and often NP-hard (especially for $1$5 sparse attacks), requiring black-box optimization or heuristic search strategies (Vo et al., 2022).
2. Core Methodologies and Algorithmic Strategies
2.1 Boundary Attack and Sampling-based Methods
The prototypical decision-based attack is the Boundary Attack (Brendel et al., 2017): it starts from a large adversarial perturbation, then performs a random walk along the decision boundary, iteratively reducing the perturbation while keeping the input misclassified. At each step:
- An orthogonal step explores new directions along the boundary.
- A small step is taken toward the clean input to minimize distance.
- Proposals that cross back to the original class are rejected, keeping the trajectory on the adversarial side.
Algorithmic proposals are drawn from i.i.d. normal or domain-informed distributions (see biased sampling below). The attack dynamically adapts step sizes to maintain efficiency near the boundary.
2.2 Evolutionary and Heuristic Search
For discrete or combinatorial constraints (notably sparse $1$6 attacks), evolutionary algorithms have been applied. For example, SpaEvoAtt (Vo et al., 2022) models a candidate perturbation as a binary mask over pixel locations and uses binary differential recombination, targeted mutation, and a fitness function that incorporates attack success and perturbation size. This approach transforms the NP-hard $1$7 search into a tractable binary vector optimization, yielding state-of-the-art sparsity and query efficiency on both CNNs and Vision Transformers.
Similar population-based heuristics have been adopted in other domains, including face recognition (Dong et al., 2019) and NLP, where attacks proceed via synonym substitution guided by semantic similarity and genetic optimization (Maheshwary et al., 2020).
2.3 Biased Sampling and Priors
Query efficiency can be significantly increased by injecting structured priors into the proposal distribution (Brunner et al., 2018):
- Image-frequency priori (e.g., Perlin noise): Focuses proposals on low-frequency, “image-like” directions, improving naturalness and transferability.
- Regional masking: Concentrates perturbations in spatial regions where source and adversarial images differ most.
- Surrogate-gradient prior: Leverages gradients from a surrogate model to guide sampling, even in the label-only regime.
These biased sampling techniques can reduce query complexity by up to $1$8–$1$9, and, when combined, outperform vanilla random-walk methods by a wide margin (Brunner et al., 2018).
2.4 Gradient Estimation and Distributional Approaches
Recent advances employ zeroth-order or distributional optimization, often reframing attack generation as policy learning:
- RL-based methods such as DBAR (Huang et al., 2022) treat the attack as learning a distribution over perturbations, maximizing the joint objective of attack success minus perturbation norm, via policy gradient in a PPO framework. This yields high attack success and improved transferability.
- The use of batch-level loss (e.g., batch accuracy loss in Decision-BADGE (Yu et al., 2023)) and simultaneous perturbation stochastic approximation (SPSA) enables efficient universal (image-agnostic) perturbation generation in the hard-label setting (Hogan et al., 2018).
DBA-GP (Liu et al., 2023) demonstrates that leveraging data-dependent priors (bilateral smoothing for edge preservation) and time-dependent priors (re-using correlated historical gradient estimates) can further accelerate convergence, particularly near the decision boundary.
2.5 Specialized and Automated Strategies
Automated program synthesis approaches (AutoDA (Fu et al., 2021)) search a space of low-level geometric vector operations to discover update rules, showing that even simple linear combinations of projected Gaussian noise and boundary-normal directions can recover or outperform expert-crafted strategies. Other domain extensions include attacks on trajectory prediction (Li et al., 27 Mar 2026) and semantic segmentation using proxy-guided, discrete structured perturbation search (Chen et al., 2024).
3. Empirical Evaluation and Query Efficiency
Decision-based attacks are empirically evaluated on curriculum from standard vision (CIFAR-10, ImageNet), face recognition (LFW, MegaFace), trajectory prediction, NLP, and segmentation targets. The primary metrics are:
- Attack Success Rate (ASR): Fraction of adversarial examples that cause misclassification.
- Query Budget: Number of queries needed to reach a target distortion or ASR.
- Perturbation Magnitude: Measured by 0, 1, or 2 norm; sparsity is especially notable for SpaEvoAtt, which achieves 3 median 4 on ImageNet (i.e., 5 pixels perturbed in 6 images) within 7 queries (Vo et al., 2022).
- Runtime: Computation per query is typically negligible relative to model evaluation.
Table: Empirical comparison for SpaEvoAtt vs. Pointwise (Vo et al., 2022):
| Setting | Attack | Median Sparsity | ASR | Queries |
|---|---|---|---|---|
| Untargeted/ImageNet | Pointwise | 0.0012 | 77% @ 0.001 | 5,000 |
| Untargeted/ImageNet | SpaEvoAtt | 0.0008 | 99% @ 0.001 | 5,000 |
| Targeted/ImageNet | Pointwise | 1.0 (fail) | -- | 20,000 |
| Targeted/ImageNet | SpaEvoAtt | 0.0076 | 99% @ 0.01 | 20,000 |
Population and mutation-rate ablations show that SpaEvoAtt is robust to parameter choices, with 8 and 9 (ImageNet) yielding optimal convergence.
Other studies compare against white-box baselines (e.g., PGD, C&W) and hard-label optimization methods (Brunner et al., 2018, Hogan et al., 2018), finding that decision-based attacks can approach or sometimes surpass white-box performance under similar query budgets.
Query efficiency is highly sensitive to the attack algorithm, the use of domain priors, and the presence of system-level invariances (e.g., input preprocessing), with unaware attacks suffering 0 degraded efficiency (Sitawarin et al., 2022).
4. Domain Extensions and Vulnerabilities
4.1 Structured Output and Specialized Domains
Decision-based attacks have been adapted to domains beyond standard image classification:
- Vision Transformers: Patch-wise Adversarial Removal (PAR (Shi et al., 2021)) exploits non-overlapping patch structure in ViTs to compress adversarial noise more efficiently, achieving lower median 1 at fixed budget, especially when used as initialization for other attacks.
- Trajectory Prediction: DTP-Attack (Li et al., 27 Mar 2026) applies a boundary-walking algorithm to the trajectory space, perturbing historical agent positions to cause intention misclassification or trajectory deviation, achieving 2 ASR with sub-meter perturbations.
- Semantic Segmentation: Discrete Linear Attack (DLA (Chen et al., 2024)) attacks per-pixel decision maps, using proxy-guided search over discrete, structured noise patterns to dramatically reduce mIoU (e.g., 3 on PSPNet/Cityscapes in 4 queries).
- Natural Language Processing: Hard-label black-box attacks combine synonym substitution, search space reduction, and genetic optimization to produce adversarial texts with minimal semantic drift and high success rates (>90%) (Maheshwary et al., 2020).
Universal (image-agnostic) perturbations have also been demonstrated in this regime (Hogan et al., 2018, Yu et al., 2023), raising the risk of attacks that generalize across samples and systems.
4.2 System-level Obstacles
Preprocessing modules preceding the classifier (e.g., cropping, resizing, quantizing) introduce invariances that can degrade the effectiveness of decision-based attacks by several fold if unaccounted for (Sitawarin et al., 2022). Preprocessor-aware strategies can recover full efficacy, emphasizing the need for adversaries to model the entire computation pipeline.
Cost-sensitive threat models, such as in content moderation or malware detection, penalize “flagged” queries more heavily. Stealthy attack variants minimize such flagged queries by leveraging egg-dropping line search and early stopping, trading more non-flagged queries for reduced detection risk (Debenedetti et al., 2023).
5. Limitations, Defensive Implications, and Open Questions
Despite significant progress, decision-based black-box attacks remain constrained by:
- Query Budget: Crafting imperceptible or highly sparse adversarials can require up to 5–6 queries for difficult targets or robust models.
- Optimization Hardness: Many objectives (e.g., 7 minimization under hard-label constraints) are NP-hard even in the white-box setting (Vo et al., 2022).
- System Invariances: Unmodeled preprocessing can waste queries and obscure gradients, substantially reducing attack efficiency unless reversed (Sitawarin et al., 2022).
- Robustness to Defenses: Defenses relying solely on output discretization are inadequate; adaptive, stateful, or randomized strategies may improve resilience but ultimately can be circumvented by adaptive adversarial policies (Tsingenopoulos et al., 2023).
Viable defense approaches now include:
- Certified 8-robust architectures, input filtering, and query-rate limiting, each with their own cost and coverage tradeoffs.
- Augmenting adversarial training with decision-based or universal perturbations can partially restore accuracy (Maheshwary et al., 2020).
Emerging research areas concern:
- Automated algorithm discovery (e.g., program synthesis of optimal update rules (Fu et al., 2021)).
- Better querying strategies for cost-sensitive or stealth-critical systems (Debenedetti et al., 2023).
- Extending structured attacks to new modalities (e.g., video, multi-agent, instance segmentation).
- Mechanisms for model-user co-adaptation in adversarial arms races (Tsingenopoulos et al., 2023).
6. Summary Table: Core Advances and Empirical Benchmarks
| Class of Attack | Key Reference | Principal Mechanism | Empirical Gains |
|---|---|---|---|
| Boundary Attack | (Brendel et al., 2017) | Random-walk, orthogonal steps | Scalable to ImageNet |
| Biased Sampling | (Brunner et al., 2018) | Perlin, mask, surrogate priors | 9–0 query speedup |
| Evolutionary (Sparse, Faces, NLP) | (Vo et al., 2022, Dong et al., 2019, Maheshwary et al., 2020) | Binary vector, CMA-ES, GA | 5–201 fewer queries |
| Automated Search (AutoDA) | (Fu et al., 2021) | Program synthesis | Best-in-class under budget |
| RL-based Distributional (DBAR) | (Huang et al., 2022) | Policy gradient PPO | High ASR, transferability |
| Universal Perturbations | (Hogan et al., 2018, Yu et al., 2023) | Zeroth-order (RGF, SPSA) | 90%+ fooling rates; 10⁶ queries |
| Patch-wise/Proxy-guided (ViT, Seg) | (Shi et al., 2021, Chen et al., 2024) | Patch removal, mIoU proxy | Order-of-mag. noise reduction |
| Preprocessor-aware | (Sitawarin et al., 2022) | Reverse-engineer preprocessing | 2–3 speedup |
| Stealthy/Cost-sensitive | (Debenedetti et al., 2023) | Minimize flagged queries | 4–5 fewer flags |
Each of these works is closely tailored to the technical aspects of decision-based black-box optimization and demonstrates that, even in the strictest information setting, deep models can be exploited with practical query budgets.
7. Broader Security and Research Implications
Decision-based black-box adversarial attacks fundamentally challenge the assumption that restricting model outputs to discrete decisions is a sufficient defense. Even when only label information is exposed, adversaries can efficiently craft imperceptible or highly sparse adversarial inputs, and these attacks extend to complex, structured-output domains and sequential models.
This realization exposes vulnerabilities in a wide range of deployed machine learning systems and shifts the security paradigm toward:
- Comprehensive modeling of system-level invariances and preprocessing.
- Defensive strategies that incorporate adaptive (possibly RL-based) responses.
- Theoretical characterization of query complexity and worst-case evaluation under adaptive threat models.
- Practical evaluation frameworks that reflect realistic cost models (including flagged queries and system responses).
Open questions remain regarding optimally query-efficient heuristics in the decision-only regime, extension to further domains, formal transferability characterizations, robust and certified defense mechanisms, and the co-evolution (“arms-race”) of attack and defense policies in adversarial settings (Tsingenopoulos et al., 2023).