White-Box & Black-Box Attack Models

Updated 7 December 2025

White-box attack models are characterized by complete model transparency, enabling direct gradient-based adversarial example generation with methods like FGSM and PGD.
Black-box attack models function with limited observability, using gradient estimation, transferability, and query-intensive strategies to craft adversarial examples.
Comparative analysis of both models reveals trade-offs in query complexity, transferability, and robustness, guiding improvements in defense mechanisms.

White-box and black-box attack models are the principal paradigms for analyzing adversarial vulnerability in machine learning systems. White-box attacks assume complete transparency of the model architecture and parameters, enabling efficient gradient-based adversarial example generation. Black-box attacks operate with limited observability, typically only model queries, and must rely on gradient estimation, transferability, or more query-intensive strategies. These models provide a spectrum of threat scenarios relevant for both defense evaluation and practical system security analysis (Li et al., 2018, Hayes, 2018, Zhou et al., 2020, Xiang et al., 2019, Głuch et al., 2020, Park et al., 9 Mar 2024, Mahmood et al., 2020).

1. Formal Definitions and Threat Assumptions

White-box attacks are predicated on the adversary having full access to the model $f_\theta$ :

Architecture: all layer types, depths, activation functions.
Parameters: all weights and biases.
Gradients: $\nabla_x L(f_\theta(x), y)$ for loss $L$ .

The canonical adversarial objective is

$\min_{\delta} \quad L(f_\theta(x + \delta), y_{\text{target}}), \quad \|\delta\|_p \leq \epsilon,$

enabling algorithms such as FGSM, PGD, Carlini-Wagner, and AutoAttack to compute potent perturbations directly and efficiently (Li et al., 2018, Hayes, 2018, Zhang et al., 2020, Wang et al., 2021).

Black-box attacks restrict adversary access to model outputs:

Score-based: softmax probabilities or logits, $f(x)\in \mathbb{R}^K$ .
Decision-based: only the top-1 class label, $c = \arg\max_k f_k(x)$ .

No architectural or gradient information is visible. Attack methods include:

Gradient estimation via queries (e.g., finite differences, NES, SPSA).
Substitute-model training with transfer of white-box attacks.
Purely gradient-free search (e.g., evolutionary or one-pixel attacks).
Hard-label attacks, in the most restricted setting (Hayes, 2018, Li et al., 2018, Williams et al., 2022, Park et al., 9 Mar 2024).

Many deployed ML systems (e.g., cloud APIs, edge devices) are only accessible in the black-box regime, making these attacks highly relevant to real-world security (Xiang et al., 2019, Mahmood et al., 2020).

2. Principal Attack Methodologies

Attack Model	Adversary Knowledge	Core Algorithms
White-box	Full model, gradients	FGSM, PGD, CW, DI-AA, AutoAttack
Black-box, score-based	Output vector only	NES, SPSA, ZOO, Art-Attack
Black-box, decision-based	Hard-label only	HSJA, Boundary, Sign-OPT, SQBA
Transfer-based black-box	Surrogate white-box model	FGSM/PGD on substitute, EigenBA

White-box algorithms leverage exact gradients for $\delta$ -efficient search:

FGSM: $\delta = \epsilon \cdot \mathrm{sign}(\nabla_x L(f_\theta(x), y))$
PGD: iterative projected steps over $\ell_p$ -balls.

Black-box algorithms require different strategies:

Estimation: NES/SPSA sample random directions, compute finite differences; two-sided schemes are robust to $\delta$ choices (Hayes, 2018).
Substitute: the attacker trains $g_\phi$ locally by querying $f$ , then generates $\delta$ using white-box attacks on $g_\phi$ . Active sampling (entropy, margin-based) reduces query intensity (Li et al., 2018, Mahmood et al., 2020).
Evolutionary: mutation-selection loops over highly parametric but low-dimensional perturbation spaces (as in Art-Attack's RGBA shapes) (Williams et al., 2022).

Decision-based methods, e.g., SQBA, operate in the most restricted regime (hard-label only), combining surrogate assistance and zero-order Monte Carlo updates to achieve extremely low query complexity (Park et al., 9 Mar 2024).

3. Transferability and Surrogate-Guided Black-box Attacks

Transferability is the phenomenon that adversarial examples crafted on one model (often with similar architecture or input domain) are surprisingly likely to also fool a different, unseen model. This property underlies surrogate-based attacks, which:

Train or acquire a local white-box model $g_\phi$ (called a surrogate).
Generate adversarial examples $x_{\text{adv}}$ via gradient-based methods on $g_\phi$ .
Deploy $x_{\text{adv}}$ to the unknown $f_\theta$ and measure misclassification success.

Empirical studies in transfer learning have shown that fine-tuned target models exhibit high transferability of adversarial examples from closely related source models, especially when model initialization is shared (Zhang et al., 2020). More advanced black-box algorithms exploit surrogate models to restrict the search space to highly effective directions (e.g., EigenBA uses principal Jacobian singular vectors from a fixed white-box model, significantly improving query efficiency) (Zhou et al., 2020).

Advanced methods such as SQBA (Small Query Black-box Attack) combine surrogate gradient directions with hard-label black-box probing, yielding query reductions of $\times 5$ – $\times 6$ compared to classical decision-based attacks for fixed perturbation budgets (Park et al., 9 Mar 2024).

4. Query Complexity and the Model Access Spectrum

Query complexity forms a continuum between white-box and black-box extremes (Głuch et al., 2020):

White-box: Effective query complexity is zero (arbitrary $\nabla_x L$ access).
Unconstrained black-box: No upper bound on adaptively chosen queries; effectively white-box in the limit.
Query-bounded black-box: Fixed maximum queries $q$ ; security guarantees relate to the entropy of random decision boundaries.

Lower bounds demonstrate that, for models with high boundary entropy (e.g., randomly oriented quadratic nets in high dimensions), exponential queries are required by any $q$ -bounded adversary to compete with white-box attacks. This framework suggests that certain learning algorithms (e.g., 1-NN, high-dimensional quadratic nets) can be provably robust against query-limited adversaries if decision boundaries are sufficiently complex (Głuch et al., 2020).

5. Gradient Estimation, Hyperparameters, and Empirical Results

Black-box attacks often rely on stochastic gradient estimation: $\hat{g}_{\text{2-sided}}(x) = \frac{1}{n}\sum_{i=1}^n \frac{f(x+\delta u_i) - f(x-\delta u_i)}{2\delta} u_i,$ where $u_i$ are random directions, $\delta$ a step size, and $n$ the number of probes. Empirical studies indicate that two-sided estimators (e.g., NES, RDSA, SPSA) are robust to $\delta$ but incur greater query costs per direction (Hayes, 2018). Query counts to reach near-white-box attack success typically range $10^4$ – $2\times10^4$ for high-dimensional image classifiers, with proper tuning reducing failure rates.

Transfer- and surrogate-based attacks can dramatically reduce queries by constraining search spaces (principal singular vectors, shape-parameter subspaces), with Art-Attack's evolutionary shapes requiring $30–50\%$ fewer queries than baseline black-box methods for equal norm constraints (Williams et al., 2022, Zhou et al., 2020).

6. Defensive Considerations under Both Attack Models

Defenses focusing solely on gradient masking or white-box models often offer marginal black-box robustness ( $<25\%$ relative gain); adaptive black-box adversaries using substitute models and query strategies can circumvent these measures (Mahmood et al., 2020). Defenses such as output randomization are specifically constructed to disrupt gradient estimation by adding Gaussian noise to each output vector at test time—provably reducing black-box attack rates to zero for ZOO and related methods without significant clean-accuracy loss (Park et al., 2021).

The enhancement of boundary entropy through both structural model design and randomized defensive wrappers provides a provable mechanism to force higher query costs on black-box adversaries; however, practical trade-offs with model accuracy and computational overhead remain significant (Głuch et al., 2020, Park et al., 2021).

7. Empirical Trends and Future Directions

Key results established in the literature include:

Superior attack efficacy and efficiency in white-box settings due to direct gradient access.
Dramatic improvements in black-box query efficiency via transfer, surrogate-guided attacks, and dimensionality reduction (principal subspaces, shape encodings).
Existence of an accuracy–robustness–query–complexity triad, wherein raising clean accuracy often increases the number of queries needed for successful black-box adversarial risk (Głuch et al., 2020, Zhang et al., 2020).
Emerging hybrid threat models (gray-box), wherein adversaries can escalate attacking power by leveraging side-channel information (e.g., power trace leaks) to partially reconstruct model internals, approaching near white-box efficacy (Xiang et al., 2019).

Continued research aims to further close the gap between black-box and white-box attack performance, while defenses focus on entropy augmentation, randomized responses, certified smoothing, and robust training protocols validated against adaptive adversaries (Zhou et al., 2020, Park et al., 9 Mar 2024, Park et al., 2021).

References: (Li et al., 2018, Hayes, 2018, Li et al., 2018, Zhang et al., 2020, Zhou et al., 2020, Xiang et al., 2019, Mahmood et al., 2020, Głuch et al., 2020, Williams et al., 2022, Park et al., 2021, Wang et al., 2021, Park et al., 9 Mar 2024)