Black-Box Evasion Attacks

Updated 14 November 2025

Black-box evasion attacks are adversarial methods that craft inputs using only a model's outputs without accessing its internals, commonly targeting image, malware, and web classifiers.
They employ optimization techniques such as derivative-free methods, surrogate modeling, and reinforcement learning to achieve efficient, stealthy perturbations under strict query constraints.
Empirical evaluations show high evasion rates with low query counts, underscoring the practical challenges in defending modern ML systems against these adaptive adversarial strategies.

Black-box evasion attacks constitute a broad and intensively studied class of adversarial attacks against machine learning classifiers and detectors, distinguished by the constraint that the attacker has no access to the training data, model parameters, or internal structure of the target model, but can interact with it only via its output (typically hard decision, or possibly a confidence score). These attacks are both theoretically rich and practically impactful, targeting domains as diverse as image classification, malware detection, link prediction in dynamic graphs, and phishing detection. Advanced variants address not only the fundamental attack-construction problem but also integrate practical constraints such as limited query budgets, stealth imperatives, and strict constraints on perturbations or payload growth.

1. Formal Problem Setting and Attack Models

In the black-box evasion setting, the adversary wishes to craft an adversarial input $x_{\mathrm{adv}}$ from an initial instance $x$ , such that a (possibly unknown) classifier $f$ misclassifies $x_\mathrm{adv}$ . Canonical examples include the $\ell_p$ -constrained image attack $x_{\mathrm{adv}} = x + \delta$ , $\|\delta\|_p \leq \epsilon$ , with $f(x_{\mathrm{adv}}) \ne f(x)$ . The attacker is limited to submitting queries $x'$ to $f$ and observing either a discrete label (hard label) or confidence score (soft label), with neither model weights nor gradient information disclosed (Sethi et al., 2018, Juuti et al., 2019, Debenedetti et al., 2023).

Attack models can be further classified by:

Type of feedback: decision-based (returns only class label), score-based (returns real-valued confidence or softmax/probability), or partial-information (returns top-k labels) (Juuti et al., 2019).
Adversary's capabilities: permitted types of input perturbations (additive, structural, functional), number and nature of queries, ability to leverage domain-specific transformations (e.g., APK manipulations (Bostani et al., 2021), code insertion in PE files (Demetrio et al., 2020)), or functional or semantic preservation constraints (e.g., preserving malicious payload, visual fidelity in phishing sites (Lei et al., 2020)).
Attack objectives: untargeted (change the prediction), targeted (force a specific label), or degrade an application-specific metric (e.g., F $_1$ in link prediction (Li et al., 17 Dec 2024)).

2. Fundamental Methodologies

2.1 Zero-Order and Surrogate-based Optimization

The foundational approaches for black-box evasion attacks fall into two broad categories:

Derivative-free (zero-order) optimization: The adversary estimates pseudo-gradients through query-driven finite differences, evolutionary strategies, MCMC, or sign-based queries (Al-Dujaili et al., 2019, Qiu et al., 2021). For example, sign-based gradient estimation as in (Al-Dujaili et al., 2019) exploits stochastic directional derivatives, yielding superior query efficiency and requiring only loss-oracle feedback. Evolution strategies such as NES and CMA-ES iteratively evolve perturbations, optimizing attack objectives by sampling and updating search distributions (Qiu et al., 2021).
Transfer-based and surrogate modeling: Adversaries train local substitute models $g$ to approximate the decision boundary of $f$ , generating adversarial examples on $g$ and transferring them to $f$ (Cox et al., 7 Nov 2025, Juuti et al., 2019). Methodologies exploit transferability between high-CKA and low-CKA surrogate models to cover diverse adversarial subspaces, with surrogate selection and risk estimation formalized via Centered Kernel Alignment (CKA) statistics and regression-based aggregators (Cox et al., 7 Nov 2025).

2.2 Reinforcement Learning-based Attacks

Recent work rigorously frames black-box evasion as sequential decision-making, deploying RL agents in Markov Decision Processes to learn adaptive attack policies (Domico et al., 3 Mar 2025, Liu et al., 16 Oct 2025, Li et al., 17 Dec 2024, Fan et al., 2020). These approaches maintain state representations encoding current input/perturbation context and optimize reward signals encoding evasion success, perturbation minimality, and stealth. RL-based agents have been shown to increase success rates and decrease median query budgets over baseline methods by up to 19.4% and 53.2% respectively (Domico et al., 3 Mar 2025).

2.3 Domain-aware and Structure-preserving Transformations

Effective black-box evasion frequently depends on constructing transformations in the “problem-space” of the target domain:

Malware and traffic: Functionality-preserving manipulations include benign-content injection (PE padding/section injection (Demetrio et al., 2020)), call-based code redivision (MalGuise (Ling et al., 3 Jul 2024)), or RL-guided traffic mutation leveraging deep pre-trained models for traffic tokenization and filling (Traffic-BERT in NetMasquerade (Liu et al., 16 Oct 2025)).
Graph-structured domains: Additions and deletions of edges in dynamic graph sequences can be optimized via graph sequential embeddings and multi-environment DDPG training, targeting prediction F $_1$ degradation under strict perturbation and query constraints (Li et al., 17 Dec 2024).
Text and webpages: Phishing and adversarial web attacls exploit DOM-level mutation and addition (preserving visual/functionality fidelity), guided by classifier feedback (Lei et al., 2020).

3. Query Efficiency, Stealth, and Constraints

3.1 Query Budget and Stealth

Attack efficiency is a primary research focus, as defenders increasingly monitor for suspicious query patterns, enforce query budgets, or throttle abusive behavior. Query-efficient approaches include:

Randomized and sign-based optimization: (Al-Dujaili et al., 2019) achieves 3.8 $\times$ fewer failures and 2.5 $\times$ fewer queries than prior art on datasets including MNIST, CIFAR10, and IMAGENET, often succeeding in as few as 12 queries.
Stealthy decision-based attacks: (Debenedetti et al., 2023) analyzes the cost asymmetry between “bad” and “benign” queries and introduces stealthy line-search and early stopping techniques, reducing “bad” queries by 1.5–7.3 $\times$ at the cost of increased benign queries (tradeoff advantageous under realistic cost models where bad queries incur heavy penalties).

3.2 Functional and Semantic Preservation

Preserving the intended malicious functionality or domain semantics is essential for evasive samples to be both effective and realistic:

Malware attacks utilize append-only byte-level perturbations, CFG transformations, or benign code transplantation to ensure executability and malware effectiveness (Demetrio et al., 2020, Hu et al., 2021, Ebrahimi et al., 2020, Bostani et al., 2021, Ling et al., 3 Jul 2024).
Webpage and phishing attacks enforce constraints such as screenshot-level visual fidelity (MSE/MAE = 0), and minimal DOM edit distances, to avoid detection by similarity-based pre-filters (Lei et al., 2020).

3.3 Attack Budget and Detection Tradeoffs

Attacks are often constrained by maximal perturbation budgets (e.g., $\ell_p$ bounds, maximum appended size) and must balance evasion rate against stealth or detectability. For instance, EvadeDroid (Bostani et al., 2021) achieves evasion rates of 80–95% while maintaining <=9 queries and stealth against commercial antiviruses, but recognizes possible tradeoffs for small APKs where payload increase raises suspicion.

4. Technical Algorithms and Empirical Results

4.1 Query-efficient Attacks via Random Search and Genetic Algorithms

EvadeDroid (Bostani et al., 2021) demonstrates that random-search over problem-space transformations enables near-optimal evasion with few queries. Similarly, functionality-preserving optimizers in (Demetrio et al., 2020) use genetic algorithms (GAMMA) to minimize classifier score plus a payload-size penalty, returning evasive PEs that evade standard static detectors and up to 12 commercial antivirus engines.

4.2 RL-based Attack Pipelines

In (Domico et al., 3 Mar 2025), adversarial example generation is cast as a Markov Decision Process, employing PPO for policy optimization. The agent perturbs $N$ pixels per step, receives a reward for reducing classifier margin, and learns to minimize the query count per successful evasion. In (Liu et al., 16 Oct 2025), NetMasquerade integrates a Traffic-BERT pre-trained model for realistic sequence generation and Soft Actor Critic RL for sequential traffic manipulation, achieving >96.65% evasion in hard-label settings with <10 modifications per attack.

4.3 Transfer-based Risk Quantification

Quantifying adversarial risk in practical deployments (especially regulatory-driven security domains) is addressed in (Cox et al., 7 Nov 2025) by selecting surrogate models with maximally diversified (high and low) CKA similarity, generating adversarial examples for each, and using regression models to estimate adversarial risk. The framework yields estimates within 1.2–3.5% MSE of true risk, using as few as 4–10 surrogates.

4.4 Experimental Metrics and Real-world Evaluation

Key results include:

EvadeDroid (Bostani et al., 2021): Evasion rates 88.9–94.8% for academic detectors, 63.6–100% across five commercial engines at ≤9 average queries.
Single-shot attacks: MalGPT (Hu et al., 2021) reaches 24.51% evasion on MalConv with a single query per instance, outperforming RNN and enhanced benign-append baselines.
Stealthy attacks (Debenedetti et al., 2023): StealthyRayS achieves 79 flagged queries (vs. 172 for RayS baseline) on a commercial API, validating superior cost efficiency for realistic threat models.

5. Domain Applications and Impact

5.1 Malware and Security Applications

Static and dynamic malware detectors, Android and Windows, are consistently vulnerable to realistic black-box evasion using problem-space and semantic-preserving transformations (Demetrio et al., 2020, Bostani et al., 2021, Ling et al., 3 Jul 2024, Ebrahimi et al., 2020, Hu et al., 2021). Universal adversarial perturbations and input-specific attacks are applicable to both near-RT and non-RT RAN applications, with real-world performance degradation observed in O-RAN testbeds (Gajjar et al., 20 Oct 2025).

5.2 Graph and Dynamic Link Prediction

Advanced methods leveraging graph sequential embeddings and deep RL achieve efficient attacks against state-of-the-art LPDG models with strict perturbation/query constraints, outperforming both random and previous RL baselines on multiple real-world graph datasets (Li et al., 17 Dec 2024, Fan et al., 2020). LP-based influence functions extend the attack's applicability to arbitrarily deep GNNs (Wang et al., 2020).

5.3 Phishing and Webpage Classifiers

Mutation-based DOM edits and additions, guided by black-box queries and minimal visual/functional distortion, yield near-perfect evasion rates against deployed phishing detectors, with observed transferability to industrial classifiers (Lei et al., 2020).

6. Limitations, Defenses, and Future Research

6.1 Limitations

Payload-size and perceptibility: Evasion attacks inducing sizable payload growth or visually detectable perturbations may be detected by secondary mechanisms (Bostani et al., 2021, Hu et al., 2021).
Dynamic behavior resistance: Many attacks are effective against static analysis, but extending to dynamic detectors is non-trivial (Demetrio et al., 2020, Bostani et al., 2021, Gajjar et al., 20 Oct 2025).
Query budget and output granularity: Label-only attacks (decision-based) present greater challenge; most efficient attack frameworks exist for soft-label or gradient-approximating scenarios (Al-Dujaili et al., 2019).

6.2 Defenses

Adversarial training: Integrating adversarially generated examples into the training loop (AT) is shown to increase the required perturbation for successful evasion, especially in O-RAN xApps/rApps (Gajjar et al., 20 Oct 2025).
Randomization and output obfuscation: Randomized defenses and withholding confidence values can impede exploitation of gradient/score-based attacks (Sethi et al., 2018, Debenedetti et al., 2023).
Similarity-based prefilters: Pre-screeners such as Pelican utilize structural similarity between recent phishing and new webpage submissions to flag likely evasions (Lei et al., 2020).
Query-limiting and anomaly detection: Monitoring for query patterns, flagged query rates, or input distributions forms a practical defense layer (Debenedetti et al., 2023, Gajjar et al., 20 Oct 2025).

6.3 Open Research Directions

Optimization under asymmetric costs: Development of attacks from first-principles for realistic cost models (penalizing flagged queries over benign) (Debenedetti et al., 2023).
Generalizability and norm-agnosticity: Extending techniques to mixture/mismatched norms (e.g., $L_1$ vs. $L_\infty$ ) and human-perceptual domains, and scaling to high-resolution or transformer-based models (Tal et al., 2023).
Automated surrogate selection and meta-learning: Systematic approaches for efficient and comprehensive surrogate model selection to cover adversarial subspaces (Cox et al., 7 Nov 2025).
Defenses for dynamic/behavioral detectors, and RL-resilient architectures: Addressing the inherent challenges in defending sequence-based or RL-trained detection systems.

7. Summary Table: Representative Black-Box Evasion Methods

Approach / Paper	Domain / Target	Success / Query Rate
Sign-based grad (Al-Dujaili et al., 2019)	Images (MNIST, CIFAR10, ImageNet)	2.5× fewer queries, 12 avg. queries (MNIST)
EvadeDroid (Bostani et al., 2021)	Android malware	80–95% evasion, 1–9 queries
MalRNN (Ebrahimi et al., 2020)	Windows malware	73.2–99.97% (40% append)
CMA-ES (Qiu et al., 2021)	ImageNet	~100% untargeted, <500 queries
RL-based (Domico et al., 3 Mar 2025)	CIFAR-10	+19.4% success, –53.2% queries
StealthyRayS (Debenedetti et al., 2023)	ImageNet, API	2–7× fewer expensive queries
Graph RL (Li et al., 17 Dec 2024)	Dynamic graph LP	F $_1$ drop up to 0.23–0.49
O-RAN UAP (Gajjar et al., 20 Oct 2025)	RAN xApps/rApps	Up to 78–95% success

Black-box evasion attack research demonstrates not only the vulnerability of a wide class of modern ML systems under realistic threat models, but also the diversity of practical attack and defense strategies demanded by the constraints and operational realities of each application domain. Further work continues to explore the frontiers of efficiency, stealth, generalizability, and robust, scalable defenses.