Blur-Robust AIGI Detection Framework

Updated 23 November 2025

The paper proposes a novel detection framework that integrates spatial, spectral, and semantic features to maintain high accuracy even under severe blur conditions.
It employs a multi-feature fusion pipeline combining noise pattern residuals, image gradient features, and pretrained vision-language priors to capture distinct forensic cues.
The framework utilizes frequency-aware backbones and teacher-student distillation, achieving robust detection performance with metrics such as >73% AP under high blur.

Blur-robust AI-generated image (AIGI) detection frameworks are specialized methodologies designed to maintain high detection accuracy of AI-synthesized images even when those images are subjected to realistic degradations such as Gaussian or motion blur, compression, or transmission artifacts. The demand for robust detection arises from the escalating fidelity of generative AI systems and the increasing prevalence of artifacts during image transmission or capture, which commonly occlude high-frequency forensic cues that many detectors rely on. Modern frameworks address these issues by fusing spatial, spectral, and semantic clues, leveraging foundation models, devising training-free paradigms, and employing knowledge distillation techniques to achieve strong generalization and resilience under blur.

1. Forensic Principles and Multi-Feature Fusion

Recent frameworks are grounded in three canonical forensic observations: (i) localized pixel-level inconsistencies left by GANs and diffusion models (e.g., upsampling grid artifacts), (ii) global deviations from natural image semantics, and (iii) frequency-domain anomalies typically suppressed or shifted by blur or compression.

A prominent blur-robust paradigm employs a multi-feature fusion pipeline with three parallel branches:

Noise Pattern Residuals (NPR): Amplifies subtle periodic fingerprints induced by transposed convolution via exhaustive within-patch differencing of tiled, non-overlapping patches. This channel captures spatial artifacts unique to generative upsampling pipelines.
Image Gradient Features: Uses guided-backpropagation on a frozen ResNet-50, extracting high-frequency response maps sensitive to edge and texture anomalies, which remain partially resilient under certain diffusion-type outputs.
Pretrained Vision-Language Priors: Extracts embeddings (e.g., CLIP ViT-L/14, 768-dim) to encode the global semantic plausibility of imagery.

These clues are fused using a multi-head cross-source attention mechanism—where the semantic prior serves as the query, recalibrating spatial cue salience across patches according to their alignment with the global distribution of real images. This fusion yields a robust local-global feature representation that is resistant to spatial or spectral forensic degradation (Cai et al., 2 Apr 2025).

2. Frequency-Aware Backbone Architectures

The fused forensic representation is processed by a lightweight, frequency-aware convolutional backbone, notably through Frequency-Adaptive Dilated Convolution (FADC) modules:

High-Frequency Path: Convolves with filters decomposed into low-/high-frequency components, where the high-frequency weight is explicitly modulated by a learned spatial gating $\lambda_h(p)$ . Dilation rates $\hat{D}(p)$ are inferred dynamically, tracking local frequency energy.
Low-Frequency Skip: Preserves smooth components via identity mapping.
Frequency Adaptation Module: Predicts the required dilation per spatial location from features, adapting to blur-imposed spectral shifts.
Frequency Selection: Decomposes features into Fourier sub-bands, applies learned masking and reweighting, and recombines via inverse FFT.
Output Integration: Aggregates the above via a nonlinear activation to produce joint spatial-spectral indicators of synthesis artifacts.

This design exploits the joint spatial-spectral distribution of artifacts, enabling the model to maintain discriminative capacity even as blur mutates canonical signal energy (Cai et al., 2 Apr 2025).

3. Training-Free and Distillation-Based Blur-Robust Detection

Several approaches bypass explicit supervised training on generative samples, targeting generalization by leveraging pretrained foundation models and knowledge of representation stability:

RIGID measures the cosine similarity between the clean and noise-perturbed embeddings by a large self-supervised ViT (e.g., DINOv2 ViT-L/14); real images yield high intra-similarity even under small random perturbations, while AI-generated images show greater feature instability. The classification operates by thresholding this similarity, calibrated on a real-image validation set. Blur robustness arises because the feature embedding of real photographs remains relatively stable under both pixel-level noise and moderate blur, while that of AI-generated images does not. Experimental results demonstrate robustness to Gaussian blur with average precision (AP) remaining above 73% even at high blur ( $\sigma=5$ ), far outperforming classic and autoencoder-based detectors (He et al., 2024).
Teacher-Student Knowledge Distillation (DINO-Detect): A fixed, pretrained DINOv3 ViT serves as the teacher, modeling sharp, clean images. Paired with a student network trained on synthetically blurred counterparts, the system distills feature, logit, and ordinal contrastive cues from the teacher to the student, enforcing consistent, blur-invariant representations. The student is thus optimized to maintain the teacher's discriminative boundaries (learned on clean data) even under motion or transmission blur. Multiple supervision signals—including classification loss, feature-level alignment, logit-level KL divergence, and blur-ordered contrastive structuring—combine to enforce both semantic and forensic invariance to blur types. This yields state-of-the-art accuracy (e.g., 81% on a motion-blur AIGI benchmark) and preserves generalization across out-of-domain and video contexts (Shen et al., 16 Nov 2025).

4. Robustness to Blur and Compression: Empirical Evaluations

Extensive evaluations across Gaussian, motion, and defocus blur (plus compression and minor corruptions) establish the effectiveness and limitations of these frameworks:

Method	Clean AP/Accuracy	Blur AP/Accuracy (severe)	Notable Findings
Blur-Robust Multi-Feature (FADC) (Cai et al., 2 Apr 2025)	$>90\%$ ACC	$>88\%$ AP at $\sigma=5$	Maintains margin even under severe blur; best when all three fusion branches are present
RIGID (He et al., 2024)	$86.7\%$ AP	$73.1\%$ AP at $\sigma=5$	Threshold free, no AI images at train, only real images for calibration
DINO-Detect (Shen et al., 16 Nov 2025)	$95.6\%$ (WildRF clean)	$88.4\%$ (WildRF blurred)	Outperforms prior art by >10-20pp under all blur types
SVD-RND (Choi et al., 2019)	$0.99$ AUROC (CelebA/SVHN)	$0.92$ AUROC (CIFAR, blurred)	Augmenting with blurred proxies enhances OOD/AIGI detection

Mild blur sometimes increases AP on diffusion model fakes, possibly due to denoising mechanisms in the diffusion pipeline exaggerating intrinsic artifact gaps (Cai et al., 2 Apr 2025). In contrast, traditional patch-level or vanilla ResNet-50 classifiers experience performance collapse under equivalent distortions.

5. Ablation and Branchwise Analysis

Ablation studies reinforce that blur robustness requires a synergistic combination of spatial, gradient-based, and semantic components. Removal of CLIP/global embeddings can degrade GAN detection accuracy by up to 16 points; suppressing gradient branch leads to near-total collapse (<60% ACC) under blur, and excluding noise-residuals similarly impairs performance. Both multi-feature fusion and frequency-aware modeling are non-trivially required (Cai et al., 2 Apr 2025). For knowledge distillation, supervision at both feature and logit levels is essential for preserving robustness, while contrastive alignment across blur intensities reinforces the model's invariance (Shen et al., 16 Nov 2025).

6. Adaptation Strategies and Implementation Considerations

Robustness across domains and deployment scenarios is achieved through several strategies:

Blur Proxy Expansion: SVD-RND generates multiple levels of SVD-based blur proxies per image, optionally augmented with geometric and low-pass transforms (DCT, Gaussian blur), enforcing a predictor to discriminate genuine data from any plausible low-rank imitation (Choi et al., 2019).
Parameter Tuning: Key hyperparameters (e.g., blur rank, effective-log rank for SVD; noise scale $\lambda$ for RIGID; trajectory length and kernel for DINO-Detect) are chosen via OOD validation or automatic calibration to maximize blur-discrimination without inducing overfitting.
Backbone Selection: Robustness is optimal with Transformers pre-trained by self-supervised objectives (e.g., DINOv2/DINOv3), as CNNs (e.g., ResNet-50) experience significant performance drop-off in heavy blur, especially as global high-frequency artifacts are suppressed (He et al., 2024).
Efficiency: Frequency-aware architectures and teacher-student heads are lightweight (sub-25M parameters for FADC, $\sim$ 1M for classifier heads in DINO-Detect), yielding real-time inference capability ( $256^2$ images in $\sim$ 15 ms on an RTX 4090 for FADC (Cai et al., 2 Apr 2025)) and simplifying deployment on-device or at the edge (Shen et al., 16 Nov 2025).

7. Impact, Limitations, and Future Directions

Blur-robust AIGI detection frameworks generate state-of-the-art accuracy in detecting synthesized images from a wide range of GAN and diffusion sources, maintain performance under challenging real-world degradations, and generalize well to out-of-distribution generators and deployment artifacts. These results have significant implications for media forensics, social-media authenticity verification, and digital safety pipelines.

Future work may target dynamic weighting of blur-aware feature branches, extending frequency-adaptive representations to temporal artifacts in video, leveraging further lightweight self-supervised backbones, and automating the adaptation to unknown or mixed-blur domains. For training-free and teacher-student approaches, continued refinement of contrastive and distillation objectives—incorporating more nuanced semantic priors—promises enhanced generalization and resilience (Cai et al., 2 Apr 2025, Shen et al., 16 Nov 2025, He et al., 2024).