Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Closed-Box Adversarial Attacks

Updated 2 July 2025

Closed-box adversarial attacks are techniques that craft deceptive adversarial examples using only limited feedback like class labels.
They employ gradient-free optimization methods—such as NES, SignHunter, and CMA-ES—to efficiently search for perturbations without internal model details.
These methods expose critical weaknesses in ML systems, driving the development of robust models and advanced defense strategies.

Closed-box adversarial attacks are a class of methods for crafting adversarial examples specifically designed to circumvent machine learning systems when the attacker has no access to the model’s internals—parameters, gradients, or sometimes even soft outputs. In these scenarios, only the most restricted forms of feedback (such as class labels, top-k predictions, or, in the most extreme cases, no queries at all) are available. Closed-box attacks form a critical subset of black-box adversarial attacks and have been pivotal in exposing the resilience (or lack thereof) of deployed learning systems under the most plausible security assumptions.

1. Definitions and Taxonomy

Closed-box adversarial attacks are defined by extremely limited access to the target model:

No model internals available: The model’s architecture and parameters are not accessible.
No (or minimal) queries: Only very restricted feedback is obtainable; this might be just a label or, in the “no-box” variant, not even that.
No surrogate model with full knowledge: The attacker cannot guarantee that a high-fidelity substitute with similar boundaries can be constructed from data.

A hierarchy emerges within non-white-box attacks:

Score-based black-box attacks: Access to confidence/logit/vector outputs.
Hard-label black-box (“decision-based” or “closed-box”) attacks: Access to only class labels or discrete feedback.
No-box attacks: No access to the model or training data, only a handful of domain-relevant examples.

Within this taxonomy, closed-box attacks deal with practical adversarial scenarios in both digital and physical settings, broadening the threat landscape previously dominated by white-box theory.

2. Methodologies and Core Techniques

Closed-box attacks have necessitated new techniques to overcome the lack of gradient or rich feedback. Several paradigms have emerged:

Stochastic Gradient-Free Optimization

Attacks often leverage population-based, gradient-free optimization strategies tailored to the restrictions on feedback:

Natural Evolution Strategies (NES) estimate gradients of the loss by querying the model on randomly sampled perturbations and aggregating feedback, e.g.:

$\nabla_{x}\mathbb{E}[J(\theta)] \approx \frac{1}{\sigma n} \sum_{k=1}^{n/2} \left[ \delta_k J(\theta+\sigma\delta_k) - \delta_k J(\theta-\sigma\delta_k)\right]$

NES has demonstrated efficiency and query robustness, especially with input-reducing tricks such as tiling small regions over images (1809.02918).

Sign-based Estimation as in SignHunter (1902.06894), posits that estimating the sign of the gradient (rather than its magnitude) is often sufficient. This reframes the optimization as binary search over sign bits:

$\vx' = \Pi_{B_p(\vx, \epsilon)}(\vx + \delta \cdot \text{sign}(\nabla_{\vx} L(\vx, y)))$

This approach can drastically lower queries (e.g., 12 queries per MNIST image).

Consensus-Based Optimization (CBO) and Variants (2506.24048): These maintain a particle swarm that is iteratively drawn toward a consensus (weighted by exponents of their objective values), combining exploitation and exploration. In favorable cases, CBO demonstrates mean-field convergence properties and superior query efficiency over NES in easier attack regimes.
Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and other evolutionary methods provide robust search in high-dimensional input spaces, efficiently balancing search and exploitation even under hard-label constraints (2104.15064).

Input-Free, Region-Limited, and Sparse Attacks

When stealthiness is not vital, attackers can dispense with similarity constraints:

Input-Free Attacks (1809.02918): Initialized from an arbitrary (e.g., gray) image, the attack need only achieve targeted output, not preserve semantics, greatly reducing attack dimensionality and queries.
Sparse-RS Framework (2006.12834): Random search is used for generating $l_0$ -bounded, patch, or frame attacks—iterating on only a small set of dimensions per step, well-suited for closed-box settings where input structure and physical realizability (as in patches) are priorities.

Hard-Label Decision-Based Optimization

When only labels are available (no confidence scores), gradient estimation is infeasible. Methods such as:

Boundary Attack: Input starts from an already adversarial point and walks along the boundary toward the original, minimizing perturbation under label-only feedback.
Spectrum-Aware Decision Boundary for Point Clouds: For 3D data, adversaries can use spectrum-fusion to interpolate in the frequency domain, preserving shape and yielding imperceptible adversarial examples while relying on a discriminator to maintain realism (2412.00404).

Substitute and Imitation Models

Even with few queries, attackers may construct low-fidelity substitute models or even imitation models via GAN-like setups (2003.12760), or generative boundary-learning surrogates that focus on modeling only boundary data distributions critical for effective attack transferability (2402.02732).

No-box Approaches

Attackers can generate transferable adversarial examples by training auto-encoding substitutes (e.g., via prototypical reconstruction) on a handful of in-domain examples without ever querying the victim model (2012.02525).

3. Experimental Results and Comparative Performance

A broad range of studies have benchmarked these attack strategies under closed-box conditions:

Query Efficient Attacks: Input-free, region-based attacks have achieved 100% success rates on ImageNet with as few as 1,700 queries (InceptionV3), compared to 100,000+ for earlier finite-difference black-box attacks (1809.02918).
SignHunter: Achieves 0% failure rate on MNIST with ~12 queries and remains competitive on CIFAR10 and ImageNet, outperforming NES and other methods on robustness and efficiency (1902.06894).
Video Domain: V-BAD achieves >93% targeted attack success rate on video models with 34k–84k queries—orders of magnitude fewer than gradient estimation per pixel (1904.05181).
Sparsity/Physical Constraints: Sparse-RS notably outperforms both black- and white-box baselines for patch/frame threats: e.g., 98.2% success for 50-pixel ImageNet attacks (2006.12834). For physical object detector attacks, model-agnostic GAN-based patches attain <20% mAP for TinyYOLO, outstripping pixel-space and square-attack baselines (2303.04238).
Hard-Label 3D Point Clouds: Spectrum-aware attacks achieve 100% success with lower $D_h$ and $D_{norm}$ than white-box or prior black-box baselines, even under label-only feedback (2412.00404).

A table summarizing recent empirical highlights:

Method/Study	Setting	Success Rate	Typical Queries	Domain
NES, region-based (1809.02918)	Score/hard-label, input-free, ImageNet	100%	~1,700	Images
SignHunter (1902.06894)	Score-based, $\ell_\infty$ , MNIST	100%	12	Images
Sparse-RS (2006.12834)	$l_0$ /patch/frame, MNIST/ImageNet	>90%	25–737	Images
V-BAD (1904.05181)	Black-box, videos, targeted/untargeted	>97%	34k–84k	Videos
GAN-patch (2303.04238)	Model-agnostic, physical detector attacks	N/A (mAP ↓)	~4,000	Object Detection
Spectrum 3D (2412.00404)	Hard-label, point cloud, ModelNet40	100%	N/A	3D point clouds

4. Defensive Strategies Against Closed-Box Attacks

Defenses for closed-box scenarios must address the challenge that internal gradients and outputs are not available to attackers:

Boundary Defense (BD) (2201.13444): Adds noise to the logits for low-confidence (“boundary”) samples, leveraging the necessity of attacks to query these regions for effective optimization. Empirical evaluation shows a reduction of attack success rates to near zero with negligible accuracy loss (≤1%) on benign samples.
Data-free Defenses (DBMA) (2211.01579): Wavelet-based denoising (WNR) and regenerator modules, trained solely on synthetic data, are prepended to the black-box classifier. This approach boosts adversarial robustness by up to 39% (CIFAR-10, Auto Attack) without access to original data or model weights.
Input Transformations and Adversarial Training: Traditional defenses (e.g., randomization, compression, adversarial training) persist, but empirical studies show that only the latter provides consistently strong closed-box robustness on contemporary models (2412.20987).

5. Impact on Robust Models and State-of-the-Art Defenses

Recent analyses show that even the most advanced closed-box/black-box attacks are significantly less effective against modern robust models:

Robust models evaluated on RobustBench (e.g., Swin-L, ConvNext-L) consistently reduce advanced black-box attack success rates to ≤4%, compared to ≥80% on standard (vanilla) models (2412.20987).
Surrogate-Target Robustness Alignment: Attack transferability from surrogate models to robust targets collapses unless the surrogate has matching robustness characteristics. Robust surrogates enable higher attack success against robust targets, while vanilla surrogates are effective only against vanilla targets (2412.20987).

This suggests that adversarial training and model selection can significantly blunt the effectiveness of even sophisticated closed-box attacks, and that attack design must now consider surrogate-target alignment explicitly.

6. Theoretical Advances and Limitations

Several theoretical contributions underpin current closed-box methods:

Mean-Field Convergence: CBO possesses provable convergence properties to global minimizers for non-convex losses when particle number grows, providing confidence in its practical stability (2506.24048).
Gradient Sign Sufficiency: Binary search over sign bits (SignHunter) minimizes query complexity compared to full gradient estimation, especially for $\ell_\infty$ -constrained adversarial examples (1902.06894).
Limitations: In challenging high-dimensional and targeted settings, advanced evolutionary methods like CMA-ES or tuned NES retain an advantage. Parameter sensitivity and the curse of dimensionality remain practical hurdles for all population-based approaches (2104.15064, 2506.24048).

7. Future Directions and Open Problems

Closed-box adversarial research continues to develop across several axes:

Stronger and Label-efficient Attacks: Further reductions in query budget, success in ultra-low feedback environments (even single-label or thresholded outputs).
Non-Image Modalities: Expansion to videos, point clouds, tabular, and multi-modal systems, each introducing unique structural constraints (e.g., spectrum-aware approaches for 3D geometry).
Defense Evaluation: Systematic evaluation of new and existing defenses under score- and label-only attacks, including benchmarking on SOTA models and deploying data-free, plug-and-play methods.
Better Theoretical Foundations: Tightening nonasymptotic query complexity lower bounds, understanding transferability dynamics, and formalizing surrogate robustness matching remain central.

Closed-box adversarial attacks now encompass a spectrum of powerful methods effective in severely restricted scenarios. The evolution from query-hungry and data-dependent approaches to input-free, gradient-free, and even data-free attacks reshapes both the adversarial risk profile for deployed models and the evaluation criteria for robust learning. These advancements demand equally innovative, practical, and theoretically supported defense mechanisms for machine learning systems operating in adversarially sensitive contexts.