Adversarially Trained Models Overview

Updated 27 December 2025

Adversarially trained models are neural networks optimized via a min–max framework that incorporates perturbed inputs to enhance robustness against norm-bounded attacks.
They deploy iterative methods like PGD and variants (e.g., free and FGSM-based training) to balance clean accuracy with improved resilience in image, audio, and language tasks.
These models not only improve empirical robustness but also influence feature representation and attack transferability, raising important ecosystem-level security considerations.

Adversarially trained models are neural networks or other machine learning systems trained under a robust optimization protocol that incorporates adversarial examples—inputs adversarially perturbed to maximize the network's loss—directly into the training process. The adversarial training paradigm formalizes this as a min–max optimization, aiming to learn model parameters that minimize the worst-case loss under norm-bounded input perturbations. This approach substantially increases empirical robustness to many classes of adversarial attacks and has wide-ranging implications for transfer learning, feature representations, attack transferability, and theoretical understanding of model robustness.

1. Adversarial Training: Objective and Implementation

A canonical adversarial training objective is formulated as a saddle-point (min–max) problem: $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\Big[\;\max_{\|\delta\|_p\le\epsilon}\; \ell\big(h_\theta(x+\delta),y\big)\Big]$ where $\theta$ denotes model parameters, $h_\theta$ is the output (logits or softmax), $\ell$ is the loss (e.g., cross-entropy), $\delta$ is constrained in an $\ell_p$ -ball of radius $\epsilon$ , and the inner maximization is typically approximated by projected gradient descent (PGD- $k$ ) (Utrera et al., 2020, Awad et al., 2 Dec 2025).

PGD attacks iteratively update $\delta$ : $\delta \leftarrow \operatorname{Proj}_{\|\delta\|_p\leq\epsilon}\big\{\delta + \alpha\, \mathrm{sign}\,\nabla_{\delta}\,\ell(h_\theta(x+\delta),y)\big\}$ with step size $\alpha$ (for $\ell_\infty$ ) or normalized gradient for $\ell_2$ settings.

Variants include:

Free Adversarial Training: Interleaves multiple PGD-like updates per minibatch, amortizing computation over replays (Awad et al., 2 Dec 2025).
Distributional Training: Maximizes loss over a distribution of perturbations $p(\delta)$ , typically regularized by entropy to avoid collapse (Dong et al., 2020).
FGSM-based (one-step) Training: Used in audio and other domains for computational efficiency (Sallo et al., 2020).

Adversarially trained models are constructed using diverse architectures, commonly ResNet-50 on ImageNet-scale datasets, but extend to audio CNNs, LLMs (in output embedding space), vision transformers, and even system surrogates such as LSTMs on physical ROMs.

2. Empirical Properties: Robustness and Transfer

Adversarially trained models reliably improve robustness to norm-constrained attacks, most dramatically for the specific norm and magnitude used during training:

Image Classification: A robust (PGD-trained, $\ell_2$ , $\epsilon=3$ ) ResNet-50 source model outperforms naturally trained models when fine-tuned on small-to-moderate data in diverse target domains. For $N=100$ samples: CIFAR-10 adversarial 51.9% vs. natural 48.5%; MNIST adversarial 83.8% vs. natural 68.7%; the gap remains at larger N but diminishes (Utrera et al., 2020).
Learning Speed: Fine-tuned robust source models exceed natural model accuracy within the first 11 epochs in data-scarce settings, consistently preserving a positive margin.
Audio and LLMs: Adversarial training on audio spectrograms (with FGSM) doubles the perturbation amplitude required to reach given fooling-rate, while incurring 10–25 percentage points clean accuracy loss (Sallo et al., 2020). In RNN LLMs, adversarial noise in the output embedding layer enforces a margin between output vectors, yielding both improved perplexity and robustness (Wang et al., 2019).
Linear Regression: $\ell_\infty$ -adversarial training provably yields sparse solutions (akin to Lasso), and recovers the minimum- $\ell_1$ -norm interpolator in the overparameterized regime for any small $\epsilon$ , sharply transitioning away from this regime as $\epsilon$ increases (Ribeiro et al., 2022, Ribeiro et al., 2023).

3. Feature Representation: Shape Bias, Semantics, and Diversity

Adversarial training biases models toward extracting feature representations that preferentially encode global, human-aligned semantic cues (shape, objectness) and actively suppress textural or spurious statistical patterns:

Shape vs Texture Bias: On Stylized ImageNet (SIN), robust models (PGD-trained) generalize better than natural models without fine-tuning (robust 20.1% vs natural 11.4%), and maintain a large performance gap after fine-tuning (robust 64.2% vs natural 36.1%) (Utrera et al., 2020).
Fourier or Low-Resolution Cues: Robust models outperform natural on downsampled or low-pass filtered images (Caltech101), indicating reliance on global shape features.
Influence Functions: Fine-tuned robust models yield a higher label-match rate for the top-1 most influential training point (78.6% vs 55.1% for natural); the top-5 majority label-match is 77.3% vs 53.8%, revealing a semantic clustering effect (Utrera et al., 2020).
Redundancy and Activation: Robust CNNs exhibit a proliferation of "always-on" activations and highly redundant, correlated feature maps, possibly as an implicit error-correcting mechanism. However, this induces simplification, linear activation regions, and some capacity trade-offs (Carletti et al., 2022).
Hidden Non-Robust Features: While adversarial training aligns the classifier with robust features, the underlying representation retains non-robust (but highly predictive) directions, which can be immediately accessed by retraining only the final layer. This reveals that adversarial training suppresses but does not excise non-robust features (Carletti et al., 2022).

4. Transferability and Ecosystem-Level Risks

Adversarially trained models exhibit paradoxical behavior regarding the transferability of adversarial examples:

Transfer Attacks: Adversarial examples crafted on robust (AT) models transfer more effectively to other AT models than those crafted on naturally trained (ST) models. For ImageNet models, AT→AT transfer reduces target accuracy to 13.2%, twice as strong as ST→ST (32.9%) (Awad et al., 2 Dec 2025).
Cross-Architecture Transfer: This phenomenon is observed across CNNs and Vision Transformers, and even cross-family (CNN→ViT, ViT→CNN); adversarially trained source models dominate transfer performance.
Feature Alignment as Root Cause: Attribution analyses show that robust models are aligned on semantically meaningful feature gradients (object parts, shapes) that become universal attack vectors, raising ecosystem-level risk as these models become widespread.
Black-box Extraction: Adversarially trained models tend to leak more actionable information in their output distributions, facilitating more accurate and robust surrogate extraction via black-box queries (up to 1.2× higher accuracy/agreement, 0.75× the queries required vs. natural models) (Khaled et al., 2022).

A recommended best practice is to evaluate both defense robustness and attack potency of any proposed robust model, with metrics quantifying both target resistance and surrogate effectiveness (Awad et al., 2 Dec 2025).

5. Theoretical Foundations and Generalizations

Recent theoretical analysis confirms the empirical intuition that adversarial training alters the implicit bias of model optimization, privileging robust, sparse, or minimum-norm features:

Feature Competition: In structured data models containing both robust (sparse, invariant) and non-robust (dense, perturbation-vulnerable) features, standard empirical risk minimization overfits to dense, non-robust directions; adversarial training provably suppresses these and prioritizes the sparse, stable feature set (Li et al., 2024).
Linear Models: Adversarial training in overparameterized linear regression (e.g., $d>n$ ) recovers the minimum-norm interpolator for small enough $\epsilon$ , and can be matched with Lasso or ridge in the underparameterized regime, with $\ell_\infty$ adversarial training corresponding to a sparsity-inducing penalty (Ribeiro et al., 2022, Ribeiro et al., 2023).
Distributional Variants: Generalizations include maximizing loss over a distribution of perturbations rather than a single worst-case example, with entropy regularization to encourage exploration of the adversarial polytope (Dong et al., 2020).
Calibrated Robust Error: Efforts to mitigate semantic drift induced by adversarial training leverage adaptive pixel-level masks and calibration with oracle decision boundaries, resulting in superior trade-offs between clean and robust accuracy (Huang et al., 2021).

6. Application Domains and Extensions

Adversarial training and its variants are widely deployed across multiple domains:

Image Recognition: The principal test bed for adversarial robustness; adversarially trained models on large-scale datasets (ImageNet, CIFAR-10/100) consistently define the empirical state-of-the-art.
Audio Classification: FGSM-based adversarial augmentation of spectrograms provides meaningful gains in adversarial resilience while highlighting the cost in nominal accuracy (Sallo et al., 2020).
Language Modeling & Translation: Output-embedding adversarial training regularizes RNNs and Transformers, increasing embedding diversity and increasing both perplexity performance and sequence robustness (Wang et al., 2019).
Model-Agnostic and Modular Defenses: Adversarially trained autoencoder augmentations (AAA) provide transferable protection to fixed, black-box downstream classifiers and improve robustness to natural corruptions (Vaishnavi et al., 2019).
Physical Models & Surrogate Dynamics: LSTM-based reduced-order models for fluid simulation, adversarially trained against a critic, yield greater forecast fidelity on air pollution dynamics (Quilodrán-Casas et al., 2021).

7. Limitations and Open Research Questions

Despite significant progress, several limitations and new vulnerabilities have emerged:

Clean Accuracy Trade-off: Robust models typically incur a drop in clean accuracy proportional to the attack strength used during training (Sallo et al., 2020, Carletti et al., 2022).
Overfitting to Training Norms: Adversarial robustness achieved in one threat model may not generalize to unseen attack classes or larger perturbation budgets (Dong et al., 2020, Nemcovsky et al., 2019).
Induced Failure Modes: Robust models may develop brittle reliance on summary statistics (e.g., color mean, position), leading to catastrophic errors on seemingly innocuous transformations (Carletti et al., 2022).
Latent Layer Vulnerabilities: Adversarial defenses typically target input space; internal feature layers may remain susceptible to adversarial perturbations unless specifically hardened (e.g., via Latent Adversarial Training) (Singh et al., 2019).
Certification and Verification: Provably certifying adversarial robustness remains challenging, but nonconvex low-rank SDP relaxations substantially tighten empirical certification gaps (Chiu et al., 2022).

Future directions highlighted across multiple studies include scaling distributional adversarial training, diversifying robust features across architectures to mitigate ecosystem risk, developing efficient robust verification for deep and large networks, and bridging the gap between formal certified defenses and empirically robust training.

References:

"Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification" (Utrera et al., 2020)
"Defense That Attacks: How Robust Models Become Better Attackers" (Awad et al., 2 Dec 2025)
"Adversarially Training for Audio Classifiers" (Sallo et al., 2020)
"Improving Neural Language Modeling via Adversarial Training" (Wang et al., 2019)
"Adversarial Distributional Training for Robust Deep Learning" (Dong et al., 2020)
"Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data" (Li et al., 2024)
"On the Properties of Adversarially-Trained CNNs" (Carletti et al., 2022)
"Surprises in adversarially-trained linear regression" (Ribeiro et al., 2022)
"Regularization properties of adversarially-trained linear regression" (Ribeiro et al., 2023)
"Calibrated Adversarial Training" (Huang et al., 2021)
"Tight Certification of Adversarially Trained Neural Networks via Nonconvex Low-Rank Semidefinite Relaxations" (Chiu et al., 2022)
"Harnessing the Vulnerability of Latent Layers in Adversarially Trained Models" (Singh et al., 2019)
"Careful What You Wish For: on the Extraction of Adversarially Trained Models" (Khaled et al., 2022)
"Adversarial Machine Learning at Scale" (Kurakin et al., 2016)
"Strength in Numbers: Trading-off Robustness and Computation via Adversarially-Trained Ensembles" (Grefenstette et al., 2018)
"Towards Model-Agnostic Adversarial Defenses using Adversarially Trained Autoencoders" (Vaishnavi et al., 2019)
"Adversarially trained LSTMs on reduced order models of urban air pollution simulations" (Quilodrán-Casas et al., 2021)