Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection

Published 30 Apr 2026 in cs.CV | (2604.27875v1)

Abstract: AI-generated images are becoming increasingly realistic and diverse, posing significant challenges for generalizable detection. While Vision Foundation Models (VFMs) provide rich semantic representations and frequency-based methods capture complementary artifact cues, existing approaches that combine these modalities still suffer from limited generalization, with notable performance degradation on unseen generative models. We attribute this limitation to two key factors: frequency shortcut bias toward easily distinguishable cues associated with specific generators and cross-domain representation conflict between high-level semantics and low-level frequency patterns. To address these issues, we propose a Frequency-aware Gated Injection Network (FGINet) to improve generalization. Specifically, we design a Band-Masked Frequency Encoder (BMFE) that applies cross-band masking in the frequency domain to reduce reliance on generator-specific patterns and encourage more diverse and generalizable representations. We further introduce a Layer-wise Gated Frequency Injection (LGFI) mechanism to progressively inject frequency cues into the VFM backbone with adaptive gating, aligning with its hierarchical abstraction and alleviating representation conflict. Moreover, we propose a Hyperspherical Compactness Learning (HCL) framework with a cosine margin objective to learn compact and well-separated representations. Extensive experiments demonstrate that FGINet achieves state-of-the-art performance and strong generalization across multiple challenging datasets.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a dual-stream architecture that integrates frequency details with semantic cues to overcome cross-domain and cross-generator shifts.
It employs a Band-Masked Frequency Encoder and Layer-wise Gated Frequency Injection to enhance feature fusion and reduce generator-specific bias.
The approach achieves up to 96.7% accuracy on benchmarks, demonstrating improved detection and generalizability against AI-generated images.

Frequency-aware Semantic Fusion with Gated Injection for Detection of AI-generated Images

Motivation and Limitations of Prior Approaches

The proliferation of high-fidelity AI-generated images, created from advanced diffusion, GAN, and transformer-based models, has rendered detection a technically formidable challenge. Conventional frequency-based detection methods exploit generator-induced spectral artifacts while Vision Foundation Models (VFMs) such as DINOv3 provide robust, open-set semantic features. However, fusion approaches using both modalities exhibit substantial performance degradation under cross-generator and cross-domain distribution shifts. This problem is attributed to two critical factors: frequency shortcut bias—over-reliance on generator-specific frequency cues, and cross-domain representation conflict—semantic and frequency features inhabit different spaces, leading to entangled and poorly separable fused representations.

FGINet Architecture and Core Innovations

FGINet introduces a dual-stream detection architecture composed of three principal innovations:

Band-Masked Frequency Encoder (BMFE): Frequency components are isolated using Haar DWT, retaining only high-frequency sub-bands (XLH, XHL, XHH), discarding low-frequency LLL, which is redundant given the semantic pathway. BMFE applies spatially-independent random masking to the frequency patches during training, promoting generator-invariant frequency features. Masking disrupts shortcut associations to specific generator artifacts, yielding improved generalization. Channel-wise representations are extracted by a lightweight convolutional stem and projected as frequency tokens compatible with the transformer backbone.
Layer-wise Gated Frequency Injection (LGFI): Instead of direct or late fusion, LGFI injects frequency tokens progressively at each transformer block (DINOv3), using learnable gates to modulate the injection strength. Gates are initialized to low values, permitting gradual adaptation without destabilizing pretrained semantic weights. Early and middle layers favor higher gate values, emphasizing low-level structural integration, while deeper layers suppress frequency information, preserving semantic abstraction.
Hyperspherical Compactness Learning (HCL): Adopting a CosFace loss, all representations are normalized onto a unit hypersphere with explicit cosine margin constraints. HCL enforces tight intra-class clusters and large inter-class separation, particularly effective against unseen generative models. Logit visualization confirms that HCL, compared to standard loss, yields more compact and confidently separated decision boundaries.

Empirical Evaluation and Ablative Analysis

FGINet consistently outperforms previous SOTA on diverse benchmarks:

GenImage: Accuracy 96.7%, demonstrating superior robustness across 8 generator types, outperforming SAFE and MRCL.
Synthbuster: Accuracy 94.3%, surpassing next-best (DDA) by 4.2 percentage points, maintaining high detection rates on diffusion-based forgeries.
WildRF and Chameleon: On real-world social media samples, FGINet achieves mAcc of 93.8% (WildRF) and 92.5% (Chameleon). Notably, on hard-to-detect fakes, FGINet's structured integration yields a fake accuracy of 89.9%, a marked leap over AIDE, SAFE, and DDA, substantiating its claim for cross-generator and cross-domain generalization.

Ablation Studies: BMFE reduces reliance on specific frequency artifacts, LGFI yields over 7% accuracy improvement relative to direct fusion, and joint inclusion of BMFE, LGFI, and HCL achieves the maximal accuracy and AP on Chameleon. LGFI applied to early transformer layers achieves optimal cross-modal alignment, validating the hierarchical integration strategy.

Robustness Testing: On RRDataset, which includes transmission and re-digitization (scanning, reprography), FGINet exhibits minimal degradation, outperforming all baselines under severe distortions. This resilience is linked to BMFE’s masking regularization and LGFI’s preservation of foundational semantic space.

Practical and Theoretical Implications

Practically, FGINet's gated, structure-aware fusion adapts seamlessly to distribution shifts, ensuring high-fidelity detection in real-world scenarios, including social media and compressed transmission pipelines. Theoretically, the progressive, gated modulation addresses longstanding modality alignment problems and sets a precedent for future multimodal fusion architectures. Its hyperspherical margin learning can be generalized to other cross-modal forensic tasks.

Future Directions

Further research may extend gated fusion to more modalities (e.g., multimodal LLMs) and explore dynamically adaptive masking ratios dependent on distributional characteristics. Incorporating self-supervised frequency representation learning could further improve performance in low-resource regimes. Robustness against adversarial attacks and deliberate artifact suppressions remains an open challenge; FGINet's structure provides a promising foundation for defense-oriented detection.

Conclusion

FGINet introduces a frequency-aware, gated semantic fusion framework for AI-generated image detection, effectively addressing frequency shortcut bias and cross-domain representation conflict. The architecture demonstrates clear improvements in both performance and generalizability, validated across synthetic, real-world, and severely degraded datasets. The design principles underpinning BMFE, LGFI, and HCL have significant implications for multimodal representation learning and robust forensic detection (2604.27875).