Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

White, Gray & Black-Box Attacks Overview

Updated 12 September 2025
  • White-, gray-, and black-box attacks are classifications based on the level of attacker access to a model’s internals, crucial for adversarial ML security.
  • Reverse engineering techniques like kennen-o and input crafting enable adversaries to extract internal model details from minimal output data.
  • Robust defenses must consider a continuum of attacker capabilities since even restricted output access can significantly elevate attack potency.

White-box, gray-box, and black-box attacks are central concepts in adversarial machine learning, characterizing the degree of attacker access to a neural network’s internal information and shaping both theoretical definitions and practical attack/defense strategies. Rigorous recent work challenges the traditional boundaries between these categories, demonstrating that even models intended to be strictly black-box can “leak” substantial internal information—complicating the established taxonomy and security guarantees.

1. Formal Definitions and Distinctions

The classical taxonomy is based on the attacker's level of access:

The practical significance of these definitions—especially the “gray-box” regime—has been emphasized by work demonstrating that there is a continuum of attack strategies interpolating between the black- and white-box extremes, with security properties varying smoothly as attacker knowledge increases (Oh et al., 2017, Costa et al., 27 Feb 2025, S. et al., 2018).

2. Methodologies for Attribute Extraction and Knowledge Elevation

The seminal contribution of “reverse engineering” black-box networks (Oh et al., 2017) exposes the effective upgrade of black-box models to gray- or near-white-box status. This is achieved by training a meta-model (kennen) to predict the internal attributes of black-box targets solely from input–output query pairs.

Key approaches include:

  • kennen-o (output-only): Learns a mapping mθm_\theta from a vector of outputs [f(xi)]i=1n[f(x^i)]_{i=1}^n to a set of internal attributes yay^a, optimized as

minθ EfF[a=1KL(mθa([f(xi)]i=1n),ya)].\min_\theta~\mathbb{E}_{f\sim\mathcal{F}} \left[ \sum_{a=1}^{K} \mathcal{L}\left(m_\theta^a\left([f(x^i)]_{i=1}^n\right), y^a\right) \right].

  • kennen-i (input-crafting): Directly optimizes the input xx so that f(x)f(x) “leaks” specific model attribute information through its output.
  • kennen-io (joint): Jointly optimizes queries and meta-model parameters to maximize extracted attribute information.

Through these techniques, attackers can infer details such as the activation functions, normalization layers, kernel sizes, optimizer, and even elements of the training data, from black-box queries—demonstrating that minimal output (even a single label) suffices for significant information leakage.

3. Attack Implementation Strategies and Effectiveness

Different knowledge regimes enable distinct attack strategies, summarized in the table below:

Attack Type Access Example Methods Effectiveness
White-box All internals FGSM, PGD, C&W (gradient-based) High (optimal perturbations)
Gray-box Partial internals Attacks with surrogate/partial gradients; SGADV Mid-to-high (context-specific)
Black-box Output only Transferability, query-based gradient estimation Mid (improved via queries)

Informed by reverse engineering, black-box attacks can approach white-box performance when enough queries are available (Bhagoji et al., 2017, Oh et al., 2017). For example, gradient estimation methods using finite differences:

FDx(g(x),δ)=[g(x+δe1)g(xδe1)2δ,]T\operatorname{FD}_x(g(x), \delta) = \left[ \frac{g(x+\delta e_1)-g(x-\delta e_1)}{2\delta}, \ldots \right]^T

allow an adversary to approximate true gradients and synthesize adversarial examples with success rates close to those of white-box counterparts. Query-efficiency techniques such as random or PCA-based grouping further reduce the cost of such attacks (Bhagoji et al., 2017, Bhambri et al., 2019).

Experimentally, attacks using meta-inferred network family information (e.g., that a model is a ResNet) increase transfer-based adversarial success from 82.2% to 85.7%—approaching the white-box “oracle” of 86.2% (Oh et al., 2017).

4. Consequences for Security and the Blurring of Attack Categories

The ability to elevate a black-box setting to gray- or near-white-box status via systematic interrogation has several critical implications:

  • Model internal knowledge is not reliably hidden. Even with only top-1 or minimal output, metamodels can recover internals at rates far above chance.
  • Security cannot rely on obscurity. Once an attacker has collected sufficient input–output pairs, defenses based only on restricting model access become ineffective.
  • Boundaries are not discrete. Attack settings form a continuum, and modest increases in query resources or information can substantially elevate attack potency (Oh et al., 2017, S. et al., 2018, Costa et al., 27 Feb 2025, Bhambri et al., 2019).
  • Adversarial vulnerability is often underestimated. Evaluations that omit gray-box or reverse-engineering attacks risk painting an overly optimistic picture of a model’s robustness.

5. Challenges and Practical Considerations

Several practical challenges arise in distinguishing and defending against attacks under these regimes:

  • Generalization Limits: If the meta-training data used to train attribute-extracting metamodels does not span the architectural space of the deployed model (“extrapolation”), performance degrades, but remains above chance (Oh et al., 2017).
  • Output Granularity: Richer outputs (full probability vectors, top-k rankings) greatly increase risk—systems that return only top-1 (or even bottom-1) are still vulnerable, but to a lesser degree.
  • Efficiency: Query-efficient black-box attacks using random or PCA grouping remain effective with limited queries, and iterative methods (similar to FGSM/PGD) further “whiten” the black-box, raising the dangers for deployed models (Bhagoji et al., 2017).

6. Impact on Design of Defenses and Deployment Strategies

The reverse engineering of internal characteristics from black-box access necessitates new defense paradigms:

  • Robustness must be evaluated under gray-box conditions—including scenarios where the attacker has inferred substantial internal information via queries or side channels.
  • Minimizing output informativeness may slow attackers but is insufficient as a standalone strategy; even highly restricted outputs (top-1 only) leak information.
  • Adversarial training and advanced defense techniques must guard against a spectrum of attacker capabilities, not only full-gradient white-box but also query-based or gray-box adversaries who might have partial (but critical) model knowledge (S. et al., 2018, Bhambri et al., 2019).
  • Evaluations for new defenses should move beyond white-box to include adaptive, attribute-inferencing black-box and gray-box scenarios (Oh et al., 2017, Bhagoji et al., 2017, Costa et al., 27 Feb 2025).

7. Synthesis and Open Issues

The traditional classification into white-box, gray-box, and black-box is increasingly recognized as a sliding scale, with real-world security properties depending on attacker persistence and the system’s output leakage. The findings highlight that effective defenses must account for the capacity of attackers to reconstruct internal knowledge and that even modest query access can suffice to “whiten” a black box. Thus, robust deployment and evaluation must consider not only full transparency and total obscurity but also the diverse nuances of partial knowledge settings.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to White-Box, Gray-Box, and Black-Box Attacks.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube