Adversarial Examples in Machine Learning

Updated 11 December 2025

Adversarial examples are deliberately perturbed inputs that cause ML models to misclassify while remaining nearly imperceptible to humans.
They encompass systematic attack taxonomies such as FGSM, PGD, and CW, differentiating between white-box and black-box threat models.
Defensive strategies like adversarial training and entropy regularization improve robustness, yet open challenges persist in physical and semantic attack domains.

Adversarial examples are inputs to machine learning models that have been deliberately perturbed—often in ways that are nearly imperceptible to humans—such that the model produces incorrect (or targeted) outputs with high confidence. Their existence reveals fundamental vulnerabilities in even state-of-the-art neural networks, with broad implications for both reliability and security in real-world deployments. Systematic research has exposed their universal characteristics, taxonomies of attack and defense, and the mathematical and geometric principles underlying their formation and transferability.

1. Formal Definitions and Attack Taxonomy

An adversarial example $x' = x + \delta$ is constructed so that $\|\delta\|_p$ is small (as measured by some norm), but $f(x') \neq y$ where $y = f(x)$ and $f$ is the classifier (Goodfellow et al., 2014, Serban et al., 2020, Wiyatno et al., 2019). The canonical problem formulation for untargeted attacks is:

$\min_{\delta} ~~ \|\delta\|_p \ \text{subject to } f(x+\delta) \neq f(x),\quad \|\delta\|_p \leq \epsilon$

For targeted attacks, the constraint is $f(x+\delta) = t$ for a chosen $t \neq y$ .

The most common norm choices are:

$\ell_\infty$ : Bounds the per-component change; used in most practical attacks/defenses.
$\ell_2$ : Penalizes global “energy” of perturbation; closely tied to geometric distance.
$\ell_0$ : Controls the number of changed features.

Adversarial example threat models are defined by:

Attacker Goal: Targeted ( $f(x') = t$ ) vs. Untargeted ( $f(x') \neq y$ ).
Knowledge: White-box (access to parameters/gradients), black-box (only outputs or labels), gray-box (partial).
Perturbation Constraint: Distance budget and domain restrictions (Serban et al., 2020, Fenaux et al., 22 Feb 2024).

2. Theoretical Explanations and Universal Scaling

Initial hypotheses attributed adversarial vulnerability to high input dimensionality or network linearity. Goodfellow et al. argued that “linear” behavior in high dimensions suffices: for models with $f(x) = w^Tx$ , $\delta = \epsilon \cdot \text{sign}(w)$ can cause large output shifts even for tiny $\epsilon$ due to the scaling of $\|w\|_1$ (Goodfellow et al., 2014). Empirical evidence confirms that both shallow and deep networks are vulnerable.

Recent work (Cubuk et al., 2017) finds that adversarial error at small perturbation sizes $\epsilon$ scales as a universal power law, $P[\text{error}|\epsilon] \approx C\epsilon^{\alpha}$ , with exponent $\alpha \approx 1$ across architectures, datasets, and attack methods. This scaling holds even for linear models, networks trained on random labels, and random data, implying that adversarial vulnerability is a consequence of the finite density of near-ties in the top logits (prediction uncertainty), rather than dimension or model class alone.

Reducing the entropy of model predictions (i.e., separating top logit differences) can improve adversarial robustness: output-entropy regularization pushes apart the largest and second-largest logits, resulting in lower adversarial error rates with only minor degradation to clean accuracy (Cubuk et al., 2017).

3. Construction Algorithms and Transferability

Most adversarial example algorithms can be classified by their optimization strategy and required knowledge (Serban et al., 2020, Wiyatno et al., 2019, Goodfellow et al., 2014):

Fast Gradient Sign Method (FGSM): $x' = x + \epsilon \cdot \mathrm{sign}(\nabla_x L(f(x), y))$ (Goodfellow et al., 2014). Single-step, efficient, white-box.
Iterative FGSM (BIM, PGD): Repeated FGSM steps, projecting to the $\ell_\infty$ - or $\ell_2$ -ball after each step; the strongest practical white-box attacks (Kurakin et al., 2016, Serban et al., 2020).
Carlini-Wagner (CW) Attack: Unconstrained optimization minimizing norm plus a penalized objective for targeted misclassification; produces minimal-norm attacks (Serban et al., 2020, Wiyatno et al., 2019).
JSMA, DeepFool, UAP, One-Pixel: Specialize for sparsity, minimal perturbation, or universality (Serban et al., 2020).

Transferability describes the phenomenon where adversarial examples crafted for one model also fool other models—even with different architectures or training sets. Transferability is strongest for non-targeted, single-step attacks (e.g., FGSM) and for models trained on similar data. This forms the basis for “black-box” attacks using surrogate models (Kurakin et al., 2016, Serban et al., 2020).

4. Extensions: Semantics-Aware and Physical Adversarial Examples

Several works have extended adversarial attacks beyond small-norm noise (Hosseini et al., 2018, Zhang et al., 2023):

Semantic Adversarial Examples: Instead of small perturbations, transformations preserve human-recognizable object identity (shape, structure). For example, shifting image hue and saturation in HSV space produces images that humans perceive as unchanged, yet cause models to err catastrophically. On CIFAR-10 with VGG16, such color-shifted adversaries reduce accuracy from 93.5% (clean) to 5.7% (Hosseini et al., 2018).
Semantics-Aware Product-of-Experts Methods: Recent probabilistic schemes construct adversarial distributions as a product of “victim” (classfool) and “distance” (semantic similarity) experts, using energy-based or diffusion models to encode semantic preservation. These attacks produce pixel-wise modifications that are much larger than pixel-norm balls yet remain difficult for humans to detect, yielding high attack success rates even against defended nets (Zhang et al., 2023).

Adversarial examples have been realized in the physical world, e.g., printed images remain adversarial after photography and image transformations, and manufactured objects (adversarial stop-signs, face masks) can physically fool detectors (Kurakin et al., 2016, Lu et al., 2017). These attacks require careful construction (multi-frame/projection optimization, Expectation-over-Transformations) to survive viewpoint and environmental variation.

5. Defenses: Adversarial Training, Robust Architectures, and Detection

The primary empirical defense is adversarial training: augmenting the training objective with adversarially perturbed inputs via min–max optimization:

$\min_\theta \mathbb{E}_{(x,y) \sim D} \left[ \max_{\|\delta\|_p \leq \epsilon} L(f_\theta(x+\delta),y) \right]$

This approach, especially with multi-step PGD-generated examples, significantly increases robustness to $\ell_\infty$ -constrained attacks at the cost of additional training time and a slight drop in clean accuracy (Kurakin et al., 2016, Serban et al., 2020). Batch-norm, sufficient model capacity, and mixed mini-batches are essential for adversarial robustness at scale.

Additional mechanisms include:

Input transformations: JPEG compression, total-variation minimization, median filtering, and feature squeezing can remove some adversarial noise, but are generally bypassed by adaptive attacks (Serban et al., 2020, Wiyatno et al., 2019).
Certified defenses: Methods provide guarantees for bounded perturbations using convex relaxations, interval bound propagation, and randomized smoothing, though they presently scale poorly to large models (Serban et al., 2020).
Entropy regularization and architecture search: Penalizing output entropy (prediction uncertainty) and optimizing architectures for adversarial robustness via neural architecture search provide further improvements (Cubuk et al., 2017).

Novel “adversarial example games” cast adversarial example generation as a min–max game over hypothesis classes, producing transferable attacks with principled worst-case guarantees (Bose et al., 2020).

6. Systematization of Adversary Knowledge and Threat Models

A rigorous taxonomy of adversarial threat models is grounded in adversary knowledge, formalized via information extraction oracles and analyzed through order theory (Fenaux et al., 22 Feb 2024). The Adversarial Example Game provides a cryptographic-style protocol to standardize comparison of attack scenarios:

White-box: Full parameter/gradient access.
Score-only black-box: Only prediction probabilities.
Label-only black-box: Only predicted classes.
Transfer/no-box: Surrogate data or training knowledge, but no direct access to target model parameters or predictions.
Hybrid: Intermediates, combining various forms of partial knowledge.

An order-theoretic partial order ( $\preceq$ ) among oracles formalizes that more knowledge increases attack power. Empirical survey confirms that transfer-only (no-box) attacks with sufficient data/training information are nearly as effective as classic black-box attacks, especially on undefended models. Standardized threat-model specification remains an open need to ensure comparability across research works.

7. Open Challenges and Future Directions

Key unresolved issues include:

Metric uncertainty: Most robust classifiers assume a fixed metric; uncertainty in the adversary’s perturbation metric can render compact robust classification impossible, even for small hypothesis classes. Results demonstrate cryptographic lower bounds for robust classification under metric uncertainty (Döttling et al., 2020).
Features and universality: Re-examinations of the “non-robust features are features” hypothesis find that non-robust features do not generalize across learning paradigms (e.g., self-supervised, generative models), and even robust features fail to confer universal robustness (Li et al., 2023).
Physical and semantic attacks: Realizing robust defenses against semantic-preserving, physically-realizable, and unrestricted perturbations (far beyond $\ell_p$ -balls) remains largely unsolved (Hosseini et al., 2018, Zhang et al., 2023, Lu et al., 2017).
Evaluating and benchmarking: Lack of standardized benchmarks, public threat-model codebooks, and robust evaluation protocols impedes fair assessment (Fenaux et al., 22 Feb 2024, Serban et al., 2020, Wiyatno et al., 2019).

Continued progress will require both refined theoretical tools (multi-metric robustness, order-theoretic formalizations) and systematized empirical practices. Cross-paradigm training, multi-objective robustification, and certified defenses against unrestricted and physically realizable adversarial examples remain at the frontier of research.