Multi-Feature Fusion Approach for Generative AI Images Detection

Published 31 Mar 2026 in cs.CV | (2603.29788v1)

Abstract: The rapid evolution of Generative AI (GenAI) models has led to synthetic images of unprecedented realism, challenging traditional methods for distinguishing them from natural photographs. While existing detectors often rely on single-feature spaces, such as statistical regularities, semantic embeddings, or texture patterns, these approaches tend to lack robustness when confronted with diverse and evolving generative models. In this work, we investigate and systematically evaluate a multi-feature fusion framework that combines complementary cues from three distinct spaces: (1) Mean Subtracted Contrast Normalized (MSCN) features capturing low-level statistical deviations; (2) CLIP embeddings encoding high-level semantic coherence; and (3) Multi-scale Local Binary Patterns (MLBP) characterizing mid-level texture anomalies. Through extensive experiments on four benchmark datasets covering a wide range of generative models, we show that individual feature spaces exhibit significant performance variability across different generators. Crucially, the fusion of all three representations yields superior and more consistent performance, particularly in a challenging mixed-model scenario. Compared to state-of-the-art methods, the proposed framework yields consistently improved performance across all evaluated datasets. Overall, this work highlights the importance of hybrid representations for robust GenAI image detection and provides a principled framework for integrating complementary visual cues.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a multi-feature fusion approach that integrates MSCN, CLIP, and MLBP extractors for robust GenAI detection.
It demonstrates that fusing complementary features improves separation and stability in detecting synthetic images across diverse generative models.
Empirical results show enhanced accuracy, reduced variance, and superior generalization compared to state-of-the-art detection methods.

Multi-Feature Fusion Strategies for Generative AI Image Detection

Introduction

The proliferation of high-fidelity Generative AI (GenAI) models has escalated the challenge of detecting synthetic images, disrupting established multimedia forensics, authenticity assessment, and content verification protocols. While early detectors based on Natural Scene Statistics (NSS) could reliably identify low-level statistical irregularities, recent diffusion models (e.g., Stable Diffusion, DALL·E 3, Midjourney) have closed this gap, producing samples that are visually and statistically aligned with natural images. The central question addressed in "Multi-Feature Fusion Approach for Generative AI Images Detection" (2603.29788) is whether a single feature space can provide reliability across generator architectures and imaging scenarios, or whether multi-domain fusion is essential for robust GenAI detection.

Multi-Feature Fusion Framework

The paper introduces a modular pipeline that encodes each input image with three parallel extractors: statistical (MSCN), semantic (CLIP), and textural (MLBP) features. These standardized representations are concatenated and passed to conventional classifiers (Gradient Boosting, Random Forest, SVM) for binary discrimination.

Figure 1: The multi-feature fusion pipeline: MSCN, CLIP, and MLBP feature encoders in parallel, followed by normalization, concatenation, and classification.

MSCN (NSS features): Extracts Haralick statistics from GLCMs derived from mean-subtracted, contrast-normalized images, capturing deviations from natural luminance and contrast statistics.

CLIP embeddings: Provides high-level, semantically rich representations leveraging the CLIP-ViT/B-32 vision-LLM, sensitive to contextual and compositional priors not captured by sole fine-grained patterns.

MLBP: Quantifies local textural patterns at multiple scales via uniform LBP histograms, detecting micro-structure regularities and boundary artifacts often overlooked in other domains.

The central hypothesis posits that the compositional failures of GenAI images occur on orthogonal axes: realistic textures may mask statistical deviations, plausible semantics may coexist with repetitive patterns, or high-level alignment may hide low-level inconsistencies. By fusing representations, classifiers are exposed to manifold deviations inaccessible to any single feature modality.

Feature Analysis and Visualization

The paper provides a targeted illustration of the behavior of MSCN and MLBP features.

Figure 2: Comparative MSCN contrast and MLBP texture maps for natural versus DALLE- and Stable Diffusion–generated images. Synthetic samples exhibit local intensity irregularities ("halos") and excessive textural regularity ("tiling"), in contrast to the stochastic continuity of real images.

This visualization confirms that certain residual artifacts from generation pipelines persist in local contrast statistics and textures, even amid high global photorealism.

Manifold Divergence and Feature Space Dynamics

A core analytical contribution is the empirical examination of the real-vs-GenAI manifold separation in various feature subspaces. The Gaussian Fréchet distance quantifies feature space divergence between natural and synthetic distributions.

Figure 3: Log-scaled Fréchet distances between natural and GenAI distributions for individual and fused features across multiple datasets. Larger values indicate stronger separability in the corresponding projection.

Stronger manifold separation correlates with improved discrimination metrics, yet not all distributional shifts align with axes discriminative for classification. A combined scatter/correlation analysis between Fréchet distance and MCC demonstrates robust Spearman correlation within each dataset but also highlights that some features yield weak practical separation despite large statistical differences.

Figure 4: MCC vs. log10 Fréchet divergence for all feature configurations. Spearman $|\rho|$ summarizes the observed correlation between feature-space manifold separation and practical detection performance.

This supports the assertion that detection efficacy depends heavily on both the breadth of divergence and its alignment with decision-critical axes, strengthening the argument for orthogonal fusion.

Embedding Space Visualization

Applying t-SNE on CIFAKE samples, the paper illustrates improvement in cluster separability as more complementary features are fused.

Figure 5: t-SNE projections for all feature combinations on CIFAKE: individual features yield overlapping real/GenAI clusters, while full fusion produces more distinct and compact class clusters.

These results both visually and quantitatively (via silhouette scores) validate the fusion hypothesis: distinct subspaces are required to reliably expose generators’ diverse failure modes.

Robustness, Generalization, and Statistical Analysis

Synthbuster, PKU-4K, CIFAKE, and FakeBench are utilized for comprehensive evaluation. CLIP dominates single-feature performance on all datasets, but is not universally sufficient: low-level (MSCN) and texture-based (MLBP) cues compensate in cases of semantic conformity. Fusion configurations (particularly ALL and CLIP+MSCN/CLIP+MLBP) consistently yield higher Accuracy and MCC, demonstrating robust cross-generator and cross-dataset generalization.

The statistical tests (Kruskal–Wallis, effect size, FDR correction) reveal:

Performance variance is dominated by feature representation, not the choice of classifier, prior to fusion.
Feature fusion systematically reduces variance and effect size, stabilizing metric outcomes across classifiers and dataset regimes.
Figure 6: Effect size ( $\epsilon^2$ ) vs. statistical significance for classifier and feature domains across datasets, indicating representation-driven variance prior to fusion and classifier-induced variability only under high-heterogeneity fusion.

Correlation analysis between Accuracy and MCC further demonstrates that fusion yields more balanced, stable decision-making compared to the higher volatility of single-feature approaches.

Generalization in Mixed-Model and Heterogeneous Scenarios

Evaluating on random mixtures of images from diverse generative models, the fused feature configurations maintain superior accuracy and reduced variance, whereas individual features frequently falter under such distribution shifts. Semantic (CLIP) features enable the highest single-feature robustness, while fusion with statistical and texture cues consistently enhances deployment reliability.

Figure 7: Grouped bar-plots of Accuracy and MCC for individual and fusion features on mixed-generator settings. Fusion significantly enhances mean performance and reduces inter-classifier variance.

Benchmarking Against State-of-the-Art

The framework surpasses established detectors—CNNDetection, PatchForensics, DMImageDetection, and Synthbuster—on all considered benchmarks. Competing methods exhibit strong dataset dependency or focus on restricted artifact types, leading to poor cross-domain generalization. Fusion-based approaches, particularly integrating CLIP, establish consistent superiority.

Figure 8: Comparative bar-plot of detection accuracy across SOTA baselines on multiple benchmarks. The proposed fusion-based approach outperforms all competitors in every regime.

Implications and Future Directions

The empirical findings underscore that GenAI detection, as of current generative capabilities, is inherently representation-centric. While classifiers can fine-tune boundaries, their efficacy is bottlenecked by the capacity of feature spaces to expose diverse generative errors. Fusion is neither naive stacking nor mere redundancy: it exploits explicit complementarity, robustly reducing sensitivity to generator, domain, and classifier choice.

This architecture points towards a future in which:

Adaptive and context-sensitive fusion, potentially leveraging attention or sample-dependent weighting, could yield further boosts in unseen generator scenarios.
End-to-end differentiable optimization of fused representations may reduce overhead of manual preprocessing and benefit from task-driven joint learning.
Video and spatio-temporal extensions will be essential as generative content pipelines evolve beyond single images.

The reliance on pretrained models introduces biases from the source training sets, and some generator families (e.g., VQ-Diffusion) still lead to performance dips, indicating room for ongoing work on transferability and bias mitigation.

Conclusion

A systematic investigation of multi-feature fusion for GenAI image detection establishes the necessity of integrating statistical, semantic, and textural cues. The fusion architecture yields higher accuracy, greater robustness, and reduced representation variance compared to all state-of-the-art baselines and single-domain methods, providing a principled foundation for future general-purpose, domain-agnostic detectors—critical as generative models continue to advance in realism and diversity.

Markdown Report Issue