White, Gray & Black-Box Attacks Overview
- White-, gray-, and black-box attacks are classifications based on the level of attacker access to a model’s internals, crucial for adversarial ML security.
- Reverse engineering techniques like kennen-o and input crafting enable adversaries to extract internal model details from minimal output data.
- Robust defenses must consider a continuum of attacker capabilities since even restricted output access can significantly elevate attack potency.
White-box, gray-box, and black-box attacks are central concepts in adversarial machine learning, characterizing the degree of attacker access to a neural network’s internal information and shaping both theoretical definitions and practical attack/defense strategies. Rigorous recent work challenges the traditional boundaries between these categories, demonstrating that even models intended to be strictly black-box can “leak” substantial internal information—complicating the established taxonomy and security guarantees.
1. Formal Definitions and Distinctions
The classical taxonomy is based on the attacker's level of access:
- White-box attack: The adversary has complete access to the target network’s internal details—architecture, weights, gradients, and often the training procedure (S. et al., 2018, Oh et al., 2017, Bhambri et al., 2019).
- Black-box attack: The attacker can query the model and observe outputs (labels or probabilities), but lacks internal information such as weights, architecture, or gradients (Bhagoji et al., 2017, Sablayrolles et al., 2019, Bhambri et al., 2019).
- Gray-box attack: The attacker has partial internal knowledge, e.g., the architecture, some aspects of the training data, or historical access to intermediate model checkpoints, but no direct access to current weights or gradients (S. et al., 2018, Costa et al., 27 Feb 2025, Wang et al., 2022).
The practical significance of these definitions—especially the “gray-box” regime—has been emphasized by work demonstrating that there is a continuum of attack strategies interpolating between the black- and white-box extremes, with security properties varying smoothly as attacker knowledge increases (Oh et al., 2017, Costa et al., 27 Feb 2025, S. et al., 2018).
2. Methodologies for Attribute Extraction and Knowledge Elevation
The seminal contribution of “reverse engineering” black-box networks (Oh et al., 2017) exposes the effective upgrade of black-box models to gray- or near-white-box status. This is achieved by training a meta-model (kennen) to predict the internal attributes of black-box targets solely from input–output query pairs.
Key approaches include:
- kennen-o (output-only): Learns a mapping from a vector of outputs to a set of internal attributes , optimized as
- kennen-i (input-crafting): Directly optimizes the input so that “leaks” specific model attribute information through its output.
- kennen-io (joint): Jointly optimizes queries and meta-model parameters to maximize extracted attribute information.
Through these techniques, attackers can infer details such as the activation functions, normalization layers, kernel sizes, optimizer, and even elements of the training data, from black-box queries—demonstrating that minimal output (even a single label) suffices for significant information leakage.
3. Attack Implementation Strategies and Effectiveness
Different knowledge regimes enable distinct attack strategies, summarized in the table below:
Attack Type | Access | Example Methods | Effectiveness |
---|---|---|---|
White-box | All internals | FGSM, PGD, C&W (gradient-based) | High (optimal perturbations) |
Gray-box | Partial internals | Attacks with surrogate/partial gradients; SGADV | Mid-to-high (context-specific) |
Black-box | Output only | Transferability, query-based gradient estimation | Mid (improved via queries) |
Informed by reverse engineering, black-box attacks can approach white-box performance when enough queries are available (Bhagoji et al., 2017, Oh et al., 2017). For example, gradient estimation methods using finite differences:
allow an adversary to approximate true gradients and synthesize adversarial examples with success rates close to those of white-box counterparts. Query-efficiency techniques such as random or PCA-based grouping further reduce the cost of such attacks (Bhagoji et al., 2017, Bhambri et al., 2019).
Experimentally, attacks using meta-inferred network family information (e.g., that a model is a ResNet) increase transfer-based adversarial success from 82.2% to 85.7%—approaching the white-box “oracle” of 86.2% (Oh et al., 2017).
4. Consequences for Security and the Blurring of Attack Categories
The ability to elevate a black-box setting to gray- or near-white-box status via systematic interrogation has several critical implications:
- Model internal knowledge is not reliably hidden. Even with only top-1 or minimal output, metamodels can recover internals at rates far above chance.
- Security cannot rely on obscurity. Once an attacker has collected sufficient input–output pairs, defenses based only on restricting model access become ineffective.
- Boundaries are not discrete. Attack settings form a continuum, and modest increases in query resources or information can substantially elevate attack potency (Oh et al., 2017, S. et al., 2018, Costa et al., 27 Feb 2025, Bhambri et al., 2019).
- Adversarial vulnerability is often underestimated. Evaluations that omit gray-box or reverse-engineering attacks risk painting an overly optimistic picture of a model’s robustness.
5. Challenges and Practical Considerations
Several practical challenges arise in distinguishing and defending against attacks under these regimes:
- Generalization Limits: If the meta-training data used to train attribute-extracting metamodels does not span the architectural space of the deployed model (“extrapolation”), performance degrades, but remains above chance (Oh et al., 2017).
- Output Granularity: Richer outputs (full probability vectors, top-k rankings) greatly increase risk—systems that return only top-1 (or even bottom-1) are still vulnerable, but to a lesser degree.
- Efficiency: Query-efficient black-box attacks using random or PCA grouping remain effective with limited queries, and iterative methods (similar to FGSM/PGD) further “whiten” the black-box, raising the dangers for deployed models (Bhagoji et al., 2017).
6. Impact on Design of Defenses and Deployment Strategies
The reverse engineering of internal characteristics from black-box access necessitates new defense paradigms:
- Robustness must be evaluated under gray-box conditions—including scenarios where the attacker has inferred substantial internal information via queries or side channels.
- Minimizing output informativeness may slow attackers but is insufficient as a standalone strategy; even highly restricted outputs (top-1 only) leak information.
- Adversarial training and advanced defense techniques must guard against a spectrum of attacker capabilities, not only full-gradient white-box but also query-based or gray-box adversaries who might have partial (but critical) model knowledge (S. et al., 2018, Bhambri et al., 2019).
- Evaluations for new defenses should move beyond white-box to include adaptive, attribute-inferencing black-box and gray-box scenarios (Oh et al., 2017, Bhagoji et al., 2017, Costa et al., 27 Feb 2025).
7. Synthesis and Open Issues
The traditional classification into white-box, gray-box, and black-box is increasingly recognized as a sliding scale, with real-world security properties depending on attacker persistence and the system’s output leakage. The findings highlight that effective defenses must account for the capacity of attackers to reconstruct internal knowledge and that even modest query access can suffice to “whiten” a black box. Thus, robust deployment and evaluation must consider not only full transparency and total obscurity but also the diverse nuances of partial knowledge settings.