Adversarial Fine-Tuning for Robust ML Models

Updated 18 March 2026

Adversarial fine-tuning is a technique that updates pre-trained models using adversarial examples to increase robustness against perturbations.
It offers compute savings by requiring only a few fine-tuning epochs compared to full adversarial training from scratch.
The method is applicable across vision, NLP, and reinforcement learning, preserving generalization while mitigating overfitting.

Adversarial fine-tuning is a class of procedures that seek to enhance the robustness of machine learning models to adversarial perturbations by updating model parameters on adversarially crafted variants of training data, typically after an initial pre-training stage. Originating from adversarial training in supervised contexts, the fine-tuning paradigm expands to various settings including self-supervised learning, transfer from robust pre-trained models, multimodal architectures, compressed/deployed models, and reinforcement learning. The primary goals are to increase final robust accuracy, prevent catastrophic forgetting of useful pre-training information, and/or reduce the computational burden relative to adversarial training from scratch.

1. Core Objectives and Mathematical Formulation

Adversarial fine-tuning generally implements a min–max robust optimization procedure over model parameters $\theta$ : $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ where $\ell$ is the loss function (e.g., cross-entropy), $\delta$ is a perturbation constrained by $\|\delta\|\leq\epsilon$ (typically in $\ell_\infty$ or $\ell_2$ ), and the maximization is approximated via PGD or FGSM.

Fine-tuning, as opposed to full adversarial training, employs a model pre-trained on either natural or adversarial data. Fine-tuning can involve only a few epochs or steps of adversarial updates, with or without updating all model parameters. This provides two key benefits:

Compute savings: Substantially less training time is required, especially when starting from robust pre-trained backbones (Chen et al., 2020, Jeddi et al., 2020).
Retention of generalization capacity: Properly designed schedules and architectures avoid overfitting to adversarial directions, which can degrade test-time clean accuracy (Jeddi et al., 2020).

Key variations exist for the structure of fine-tuning minima, for the construction of adversarial examples (input or feature space; embedding or token for NLP), and for the explicit regularizers to retain pre-training knowledge (Chen et al., 2020, Ebrahimi et al., 2021, Dong et al., 2021).

2. Canonical Fine-Tuning Protocols and Computational Considerations

Standard fine-tuning proceeds from a pre-trained checkpoint $\theta_0$ and involves a short regime ( $T\sim$ 8–12 epochs) of adversarial updates:

Adversarial examples are generated via multi-step PGD (e.g., $K=10–20$ steps, $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 0 schedule aligned with application, step size $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 1).
Learning rate scheduling is pivotal. The "slow start, fast decay" policy introduces adversarial samples gently and quickly anneals the learning rate to prevent overfitting to adversarial neighborhoods (Jeddi et al., 2020):

$\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 2

with $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 3 typically an order of magnitude smaller than pre-training $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 4 and $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 5 (Jeddi et al., 2020).

Compute efficiency: On CIFAR-10, standard PGD adversarial training requires $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 6100 epochs, while adversarial fine-tuning from a pre-trained model achieves comparable or better robustness in $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 7 epochs (~10 $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 8 speedup) (Jeddi et al., 2020). Similar multipliers are found for ImageNet and large-scale networks.

Effectiveness: Adversarial fine-tuning yields robust accuracy gains over baselines (e.g., +3.83% robust accuracy on CIFAR-10 vs. end-to-end AT) (Chen et al., 2020). Ensemble strategies across self-supervised tasks or final models can further boost robustness.

3. Architectures, Domain-Specific Extensions, and Hybrid Approaches

Adversarial fine-tuning has been adapted to a wide range of model and task scenarios:

Vision and VLMs:

Self-supervised representations: Integrating adversarial objectives into self-supervised pre-training yields robust backbones, which, upon fine-tuning, provide compute-efficient downstream robustness and higher final robust accuracy (Chen et al., 2020).
CLIP and vision–LLMs: Methods such as PMG-AFT (Wang et al., 2024), SAFT (Zhang et al., 12 Feb 2026), and Sim-CLIP (Hossain et al., 2024) introduce auxiliary branches, semantic-ensemble attack losses, or unsupervised Siamese similarity objectives to prevent overfitting and retain zero-shot generalization after adversarial fine-tuning. These methods report up to +5% robust accuracy over prior state-of-the-art and a reduction in clean accuracy loss (from ~13% to <5%).

NLP and Pre-trained LLMs:

Multi-step PGD is preferred over single-step FGSM for strong regularization (Ebrahimi et al., 2021), with carefully tuned $\min_{\theta}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\,\Big[\,\max_{\|\delta\|\leq\epsilon}\;\ell(f_\theta(x+\delta),y)\,\Big]$ 9 in embedding space.
To prevent catastrophic forgetting (loss of pre-trained linguistic syntax/structure), methods employ information-theoretic regularizers (RIFT (Dong et al., 2021)), mutual information constraints between student and teacher representations, or explicit parameter update masking based on Fisher information (RoAST (Kim et al., 2023)).
Domain-adversarial fine-tuning (AFTER (Vernikos et al., 2020)) uses a gradient-reversal domain classifier on pooled representations to force invariance between downstream and out-of-domain representations, regularizing the model to retain pre-training features.

Reinforcement Learning and Control:

Adversarial fine-tuning bridges offline pre-training and online robustness by injecting worst-case action-space perturbations during the online phase, allowing the agent to acquire compensatory behaviors. Adaptive curricula on perturbation probability balance robustness and nominal performance, with robust policies achieved in $\ell$ 0200–300K online steps vs. $\ell$ 11M from scratch (Ayabe et al., 15 Oct 2025).

Speech and Time-Series:

In ASR systems, adversarial fine-tuning applied jointly to front-end denoisers and ASR backbones (or with ASR frozen) achieves significant WER reductions under both FGSM and PGD attacks. Freezing the ASR during denoiser adversarial fine-tuning yields the best robustness–performance trade-off under the strongest attacks (Joshi et al., 2022).

4. Theoretical Insights and Empirical Findings

Benefits of Adversarial Fine-Tuning:

Feature stability and flatter minima: Adversarial pre-training imparts local stability to features, causing fine-tuning to initialize in a flatter (locally robust) region of the loss landscape; thus, the expensive and sample-intensive exploration required by cold-start adversarial training is avoided (Chen et al., 2020).
Mitigation of overspecialization and representational collapse: For LLMs, adversarial fine-tuning with appropriate regularizers spreads attention, maintains hierarchical syntax and representational diversity, and counteracts low-rank, bag-of-words collapse induced by loss surfaces optimized for a single downstream task (Ebrahimi et al., 2021, Dong et al., 2021).
Preservation of zero-shot and OOD generalization: Auxiliary branches (e.g., PMG-AFT) or mutual information constraints ensure that the robust fine-tuned model’s features remain close to the pre-trained manifold, avoiding overfitting to adversarial bubbles and maintaining transferability (Wang et al., 2024, Zhou et al., 2024).

Limitations and Trade-Offs:

Pruning or quantization alone can destructively affect robustness unless combined with adversarial fine-tuning; a few epochs of such fine-tuning on compressed models restore nearly all robustness, enabling efficient deployment (Thorsteinsson et al., 2024).
In models with strong domain shift between pre-training and downstream data, traditional adversarial training or input-level defenses can fail or destroy clean accuracy (TA drop $\ell$ 250%); more nuanced fine-tuning or regularization is required (Zhou et al., 2024).

5. Advanced Strategies: Ensembles, Semantics, and Game-Theoretic Formulations

Recent advances develop richer adversarial fine-tuning protocols exploiting model capacity, semantic knowledge, or game-theoretic principles:

Task-ensemble and diversity regularization: By adversarially pre-training across diverse self-supervised objectives and enforcing orthogonality in adversarial directions, downstream fine-tuning gains further robustness (+3.59% robust accuracy in task-ensembles, up to +7% in brute-force model ensembles) (Chen et al., 2020).
Semantic-ensemble adversarial fine-tuning: SAFT (Zhang et al., 12 Feb 2026) constructs adversarial examples against an ensemble of hallucination-filtered, LLM- or MLLM-generated descriptions, leading to universally adversarial perturbations and top performance across 16 zero-shot datasets. Prompt set diversity and hallucination filtering are critical.
Game-theoretic min–max equilibrium: MAT (Zhong et al., 2023) recasts adversarial fine-tuning as a mixed-strategy zero-sum game, solving for Nash equilibria via Entropy Mirror Descent. Approximating mixed distributions over model parameters and adversarial directions yields superior generalization compared to pure-strategy PGD-based methods.

6. Implementation Guidelines and Benchmarks

Number of fine-tuning epochs: Empirical studies recommend 8–12 adversarial fine-tuning epochs for large-scale CV/NLP models, with longer runs favoring overfitting (Jeddi et al., 2020).
PGD settings: Use PGD with $\ell$ 3 steps per batch, $\ell$ 4; $\ell$ 5 matches the intended threat model (e.g., $\ell$ 6 for CIFAR-10).
Learning rate scheduling: Employ a small-to-peak-to-zero schedule; do not use standard step-decay plateaus in fine-tuning (Jeddi et al., 2020).
Model-specific tips: For batch-normalized vision models, TWINS (Liu et al., 2023) locks in the pre-training BN statistics for a “frozen” parallel path, mixing these with adaptive path gradients to accelerate learning and curb overfitting (empirically +1–2% gains in robust accuracy vs. standard AT).
Compressed deployment: After aggressive compression, perform $\ell$ 73 epochs of adversarial fine-tuning for optimal robustness/efficiency balance (Thorsteinsson et al., 2024).
Feature space tuning: For enhancing adversarial transferability (e.g., targeted attacks), fine-tune in the feature space to encourage target-class features and suppress original-class features at an intermediate layer (Zeng et al., 2024).
Domain invariance and generalization: For NLP, use adversarial domain classifiers (AFTER (Vernikos et al., 2020)) or selective gradient updates (RoAST (Kim et al., 2023)) to prevent catastrophic task-specific drift.

7. Impact, Open Challenges, and Future Directions

Adversarial fine-tuning has emerged as a critical tool for robustifying models beyond plain adversarial training or standard fine-tuning. Its impacts span:

Enabling the transfer of adversarial robustness to previously unseen tasks, domains, and model architectures;
Dramatically reducing computational requirements for robust model deployment;
Addressing real-world requirements such as robust compressed models, detection of problematic content in LLMs, and robustness in safety-critical systems (e.g., robotics, medical signal analysis).

Open challenges and active lines of research:

Universal and certified robustness: Extending generalizable fine-tuning protocols to stronger, diverse, or certified adversarial threat models;
Joint vision–language attacks and multimodal robustness: Adaptation to non-image modalities and joint model components;
Theoretical understanding of robustness-generalization trade-offs: Formal guarantees for mutual information, task-ensemble, or game-theoretic methods;
Integration with new architectures (e.g., transformers in vision) and emerging settings (e.g., foundation models, continual learning).

For comprehensive empirical and methodological benchmarks, refer to (Chen et al., 2020, Jeddi et al., 2020, Ebrahimi et al., 2021, Dong et al., 2021, Liu et al., 2023, Kim et al., 2023, Wang et al., 2024, Thorsteinsson et al., 2024, Hossain et al., 2024, Ayabe et al., 15 Oct 2025), and (Zhang et al., 12 Feb 2026).