DiffPure: Diffusion-Based Purification
- DiffPure is a framework that leverages diffusion models for both adversarial purification and image restoration by combining a forward noising process with a learned reverse denoising step.
- The approach demonstrates robust defense against diverse attacks, achieving higher robust accuracy on benchmarks like CIFAR-10, ImageNet, and CelebA-HQ compared to traditional methods.
- DiffPure extends to image restoration tasks, using a decoupled variable-splitting framework to efficiently address inverse problems such as denoising, deblurring, and super-resolution.
DiffPure is a framework that leverages diffusion models for adversarial purification and image restoration. DiffPure purifies adversarial or corrupted images by combining a forward noising process with a learned reverse generative (denoising) process, employing diffusion models originally developed for generative modeling. This approach is classifier-agnostic and attack-agnostic, enabling both robust defense against adversarial threats and state-of-the-art performance across various tasks. DiffPure has been developed and extended in several research directions, with significant technical, theoretical, and empirical contributions across adversarial robustness, multi-modal security, black-box transferability, and inverse problem solving.
1. Algorithmic Principles and Mathematical Formulation
At the core of DiffPure is the utilization of diffusion models, specifically Denoising Diffusion Probabilistic Models (DDPMs), for adversarial purification and image restoration. The technique consists of two sequential stochastic processes:
1. Forward Diffusion (Noising) Process:
Given an input (potentially adversarial or corrupted), the forward process gradually adds Gaussian noise via a Markov chain:
with schedule .
2. Reverse Generative (Denoising) Process:
A neural network, typically a U-Net-based score or noise predictor , is trained to reverse the noising process:
where is parameterized to predict the original clean image or the added noise.
Purification Pipeline:
To remove an adversarial perturbation , a small forward noising step is applied to the adversarial example , followed by the reverse generative process. This two-stage operation “washes out” structured adversarial noise and recovers the clean image manifold (Nie et al., 2022, Ankile et al., 2023). Pseudocode for the process consistently appears across studies, focusing on closed-form forward noising and discretized reverse denoising via solvers such as Euler-Maruyama or DDIM.
Adjoint-Based Gradient Computation:
For rigorous evaluation against adaptive adversaries, DiffPure supports adjoint sensitivity analysis to compute full gradients of the reverse generative process, enabling efficient, memory-bounded white-box attacks by integrating the state with its gradient adjoint within the SDE (Nie et al., 2022).
2. Theoretical Foundations and Trade-offs
The theoretical underpinning of DiffPure lies in the properties of diffusion processes:
- The forward SDE (Variance-Preserving) is governed by:
which incrementally maps data to Gaussian noise.
- The reverse SDE recovers images, approximating the intractable with a learned score network via denoising score matching.
Noise–Recovery Trade-off:
There exists a fundamental trade-off: increasing the noising time better removes adversarial but degrades the recovery of fine image details due to over-noising. Empirical analysis demonstrates that robust accuracy increases but clean accuracy drops as purification strength grows (Li et al., 2024). This trade-off is intrinsic to vanilla DiffPure, as the reverse model is only trained on clean data and cannot uniquely distinguish between Gaussian noise and adversarial corruption.
Extensions with Specialized Diffusion:
To address these limits, the Adversarial Diffusion Bridge Model (ADBM) constructs a direct reverse “bridge” from the diffused adversarial distribution back to the clean manifold, with a custom loss bridging distributional mismatch and provable bounds on recovery. The bridge formulation yields strictly higher probability of exact recovery compared to DiffPure, as both theoretical bounds and experimental re-evaluations under reliable attacks confirm (Li et al., 2024).
3. Empirical Performance across Adversarial Defense
DiffPure demonstrates efficacy as a defense mechanism against a broad range of adversarial attacks, comparing favorably to GAN/EBM-based purification and adversarial training:
Key Benchmarks (Nie et al., 2022, Ankile et al., 2023, Li et al., 2024):
- CIFAR-10 (WideResNet-28-10, , ):
- Madry-PGD: 62.8% robust accuracy
- DiffPure: 70.6% robust accuracy
- ImageNet (ResNet-50, ):
- Salman et al.: 37.9% robust accuracy
- DiffPure: 40.9% robust accuracy
- CelebA-HQ (eyeglasses, BPDA+EOT):
- GAN inversion: 75.0%
- DiffPure: 90.6%
On the PatchCamelyon histopathology dataset, DiffPure recovers ~88% of the original model’s accuracy under PGD attack, outperforming both vanilla and adversarially trained classifiers (Ankile et al., 2023). Robust accuracy remains high across a range of noise levels, indicating defense stability.
Adaptive Attacks:
The reliability of DiffPure was historically gauged under white-box adaptive attacks; later evaluations found that more exhaustive adaptive threat models (e.g., PGD+EOT with larger steps) reduce robust accuracy, and models such as ADBM provide further improvement (Li et al., 2024).
Multi-modal and Vision-LLMs:
DiffPure-VLM applies diffusion purification to images before input to multimodal LLMs (VLMs) such as MiniGPT-4, LLaVA, and InternVL2. When combined with noise-augmented fine-tuned VLMs (“Robust-VLGuard”), attack success rates against perturbation-based and optimization-based attacks are significantly reduced, while helpfulness metrics remain near baseline (e.g., ASR reductions from 70.6%→33.4% for InternVL2, ) (Wang et al., 2 Apr 2025).
4. Evaluation under Transferable Black-Box and Refusal Attacks
Black-Box Transferability:
Systematic evaluation reveals that DiffPure blocks most augmentation-based and gradient-stabilization black-box attacks (e.g., DI, TI, MI), with post-purification success rates around 16–28%. However, feature-disruption and generative modeling attacks (e.g., CDA, GAPF) largely bypass DiffPure, with success rates of up to 62% in the case of CDA, exposing a distributional mismatch vulnerability (Zhao et al., 2023).
Refusal Perturbations in Multi-modal LLMs:
DiffPure, when deployed as a test-time defense against “refusal” perturbations in state-of-the-art multimodal LLMs (e.g., LLaVA-1.5), drastically reduces the MLLM refusal rate from >0.9 to ≈0.0 with two diffusion steps. However, clean-image VQA accuracy also drops (from 0.92 to 0.78), and latency overhead is non-trivial (up to 13% increase). Additionally, comparable defense can be achieved with simple Gaussian noise addition, with slightly better accuracy trade-offs (Shao et al., 2024). All effective countermeasures, including DiffPure, incur accuracy and/or efficiency penalties.
5. DiffPure for Image Restoration and Inverse Problems
DiffPure extends beyond adversarial defense into inverse problems such as denoising, deblurring, inpainting, and super-resolution. The decoupled variable-splitting framework alternates between (i) a data-consistency optimizer (e.g., proximal step or gradient descent) and (ii) a diffusion-purification phase (DPUR):
- Data-consistency step:
Approximate solution using gradient descent for measurement fidelity.
- Diffusion purification step (DPUR):
Forward noising followed by reverse diffusion using any diffusion sampler (DDPM, DDIM, Tweedie, consistency models).
- Efficiency:
Orders-of-magnitude faster than coupled diffusion-posterior samplers (e.g., 27.7 s vs 387 s for pixel-space DDPM on images, or 2.1 s with Tweedie model).
- Restoration metrics:
On LSUN-Bedrooms, pixel-space DiffPure (V1) achieves PSNR/SSIM/LPIPS of 26.1/0.768/0.215 for super-resolution, outperforming prior methods (Li et al., 2024).
This decoupling allows “plug-and-play” adaptation to latent diffusion, accelerated sampling, and is extensible to consistency-based models with single-step inference. The approach achieves state-of-the-art trade-offs between speed and restoration quality.
6. Limitations, Open Questions, and Future Directions
Limitations:
- Noise–Fidelity Trade-off:
Larger purification steps improve robustness to adversarial and structured perturbations but reduce the preservation of high-frequency image details and clean accuracy.
- Distributional Mismatch:
DiffPure can be bypassed by attacks with low-frequency or semantic structure not modeled during score network training (feature or generative attacks) (Zhao et al., 2023).
- Inference Cost:
Slower inference due to multiple reverse diffusion steps, with per-image latency scaling with diffusion steps (Wang et al., 2 Apr 2025, Shao et al., 2024).
- Sensitivity to Color Shifts:
Diffusion purification may wash out color cues, affecting downstream classifier performance (Nie et al., 2022).
- Applicability to Complex Tasks:
Extension to nonlinear inverse operators or heavy-noise regimes can require adaptive or learned schedules (Li et al., 2024).
Research Directions:
- Development of bridge-models for adversarial purification that directly reverse the diffused adversarial trajectory (ADBM) and provable guarantees (Li et al., 2024).
- Joint pretraining or end-to-end optimization of diffusion models and task networks (e.g., classifiers or VLM vision encoders) for task-conditioned purification.
- Integration of accelerated diffusion samplers (e.g., DDIM, DPM-Solver, consistency models) and schedule adaptation for speed–accuracy trade-offs.
- Robustification of score models against a wider range of perturbation classes, including low-frequency or semantic attacks.
- Data-driven or meta-learned purification schedules and control of the noise–fidelity boundary.
7. Context and Significance
DiffPure has established diffusion-based purification as a versatile paradigm for adversarial and out-of-distribution defense, offering high sample fidelity, mode coverage, and intrinsic stochasticity unattainable by GAN or EBM-based models. Its agnosticism to classifier architecture and attack formulation permits “plug-and-play” deployment across domains. Nonetheless, recent benchmarks indicate the necessity for tighter theoretical integration between purification dynamics and adversarial objectives; specialized variants such as ADBM, or defenses coupled with task-aligned training and evaluation, show the next direction for robust purification in high-stakes applications (Nie et al., 2022, Li et al., 2024, Wang et al., 2 Apr 2025, Zhao et al., 2023, Li et al., 2024).