- The paper introduces a dual-stream architecture that integrates frequency details with semantic cues to overcome cross-domain and cross-generator shifts.
- It employs a Band-Masked Frequency Encoder and Layer-wise Gated Frequency Injection to enhance feature fusion and reduce generator-specific bias.
- The approach achieves up to 96.7% accuracy on benchmarks, demonstrating improved detection and generalizability against AI-generated images.
Frequency-aware Semantic Fusion with Gated Injection for Detection of AI-generated Images
Motivation and Limitations of Prior Approaches
The proliferation of high-fidelity AI-generated images, created from advanced diffusion, GAN, and transformer-based models, has rendered detection a technically formidable challenge. Conventional frequency-based detection methods exploit generator-induced spectral artifacts while Vision Foundation Models (VFMs) such as DINOv3 provide robust, open-set semantic features. However, fusion approaches using both modalities exhibit substantial performance degradation under cross-generator and cross-domain distribution shifts. This problem is attributed to two critical factors: frequency shortcut bias—over-reliance on generator-specific frequency cues, and cross-domain representation conflict—semantic and frequency features inhabit different spaces, leading to entangled and poorly separable fused representations.
FGINet Architecture and Core Innovations
FGINet introduces a dual-stream detection architecture composed of three principal innovations:
- Band-Masked Frequency Encoder (BMFE): Frequency components are isolated using Haar DWT, retaining only high-frequency sub-bands (XLH, XHL, XHH), discarding low-frequency LLL, which is redundant given the semantic pathway. BMFE applies spatially-independent random masking to the frequency patches during training, promoting generator-invariant frequency features. Masking disrupts shortcut associations to specific generator artifacts, yielding improved generalization. Channel-wise representations are extracted by a lightweight convolutional stem and projected as frequency tokens compatible with the transformer backbone.
- Layer-wise Gated Frequency Injection (LGFI): Instead of direct or late fusion, LGFI injects frequency tokens progressively at each transformer block (DINOv3), using learnable gates to modulate the injection strength. Gates are initialized to low values, permitting gradual adaptation without destabilizing pretrained semantic weights. Early and middle layers favor higher gate values, emphasizing low-level structural integration, while deeper layers suppress frequency information, preserving semantic abstraction.
- Hyperspherical Compactness Learning (HCL): Adopting a CosFace loss, all representations are normalized onto a unit hypersphere with explicit cosine margin constraints. HCL enforces tight intra-class clusters and large inter-class separation, particularly effective against unseen generative models. Logit visualization confirms that HCL, compared to standard loss, yields more compact and confidently separated decision boundaries.
Empirical Evaluation and Ablative Analysis
FGINet consistently outperforms previous SOTA on diverse benchmarks:
- GenImage: Accuracy 96.7%, demonstrating superior robustness across 8 generator types, outperforming SAFE and MRCL.
- Synthbuster: Accuracy 94.3%, surpassing next-best (DDA) by 4.2 percentage points, maintaining high detection rates on diffusion-based forgeries.
- WildRF and Chameleon: On real-world social media samples, FGINet achieves mAcc of 93.8% (WildRF) and 92.5% (Chameleon). Notably, on hard-to-detect fakes, FGINet's structured integration yields a fake accuracy of 89.9%, a marked leap over AIDE, SAFE, and DDA, substantiating its claim for cross-generator and cross-domain generalization.
Ablation Studies: BMFE reduces reliance on specific frequency artifacts, LGFI yields over 7% accuracy improvement relative to direct fusion, and joint inclusion of BMFE, LGFI, and HCL achieves the maximal accuracy and AP on Chameleon. LGFI applied to early transformer layers achieves optimal cross-modal alignment, validating the hierarchical integration strategy.
Robustness Testing: On RRDataset, which includes transmission and re-digitization (scanning, reprography), FGINet exhibits minimal degradation, outperforming all baselines under severe distortions. This resilience is linked to BMFE’s masking regularization and LGFI’s preservation of foundational semantic space.
Practical and Theoretical Implications
Practically, FGINet's gated, structure-aware fusion adapts seamlessly to distribution shifts, ensuring high-fidelity detection in real-world scenarios, including social media and compressed transmission pipelines. Theoretically, the progressive, gated modulation addresses longstanding modality alignment problems and sets a precedent for future multimodal fusion architectures. Its hyperspherical margin learning can be generalized to other cross-modal forensic tasks.
Future Directions
Further research may extend gated fusion to more modalities (e.g., multimodal LLMs) and explore dynamically adaptive masking ratios dependent on distributional characteristics. Incorporating self-supervised frequency representation learning could further improve performance in low-resource regimes. Robustness against adversarial attacks and deliberate artifact suppressions remains an open challenge; FGINet's structure provides a promising foundation for defense-oriented detection.
Conclusion
FGINet introduces a frequency-aware, gated semantic fusion framework for AI-generated image detection, effectively addressing frequency shortcut bias and cross-domain representation conflict. The architecture demonstrates clear improvements in both performance and generalizability, validated across synthetic, real-world, and severely degraded datasets. The design principles underpinning BMFE, LGFI, and HCL have significant implications for multimodal representation learning and robust forensic detection (2604.27875).