Cross-Generator Image Classification
- Cross-generator image classification is a technique that distinguishes real images from synthetic ones by analyzing universal authenticity cues rather than generator-specific artifacts.
- It leverages foundation models such as DINOv3, MAE, and CLIP to achieve robust zero-shot detection and high accuracy across unseen generative models.
- Benchmark results show significant improvements over artifact-based detectors, enabling efficient transfer learning, interpretability, and practical applications in visual forensics.
Cross-generator image classification is a paradigm in visual forensics and recognition where a model is required to distinguish real images from synthetic images ("forgeries") generated by a diverse and ever-evolving set of generative models. Rather than focusing on the artifacts or signatures specific to a single generator, robust cross-generator classification seeks universal cues of authenticity that generalize to images from unseen generators, including both GANs and diffusion models. This field has been accelerated by large-scale benchmarks (e.g., GenImage) and foundation architectures (e.g., DINOv3, MAE, CLIP), which expose the critical limitations of prior artifact-memorizing detectors and motivate the search for generator-invariant, interpretable, and efficient solutions.
1. Formal Problem Definition and Benchmarking
The cross-generator problem is formalized as learning a discriminant function , trained on samples from a set of seen generators and tested on a disjoint set of unseen generators , with (Huang et al., 27 Nov 2025, Zhu et al., 2023). For image :
Benchmarks such as GenImage provide large, multi-generator testbeds where detectors are trained on fakes from one generator and tested on the others. Typical datasets involve up to eight generators (Midjourney, SD V1.4/V1.5, ADM, GLIDE, Wukong, VQDM, BigGAN), with balanced splits: 162k fakes + 162k reals per training subset, and 6k fakes + 6k reals per test generator. Metrics include accuracy, AUC, and average precision across train–test generator pairs (Zhu et al., 2023). Table 1 below illustrates the characteristic drop in accuracy when testing on unseen generators for standard CNN or Transformer-based detectors.
| Training ↓ | Midjourney | SD V1.4 | ADM | GLIDE | Wukong | BigGAN | Avg |
|---|---|---|---|---|---|---|---|
| Midjourney | 98.8 | 76.4 | 64.1 | 78.9 | 71.4 | 50.1 | 71.1 |
| SD V1.4 | 54.9 | 99.9 | 53.5 | 61.9 | 98.2 | 52.0 | 72.1 |
| ADM | 58.6 | 53.1 | 99.0 | 97.1 | 53.0 | 88.3 | 70.4 |
| GLIDE | 50.7 | 50.0 | 56.0 | 99.9 | 50.3 | 74.0 | 60.2 |
Within-generator accuracy is near-perfect; cross-generator accuracy often collapses toward chance except for closely related architectures.
2. Failure Modes of Artifact-Memorizing Detectors
Traditional classifiers, whether based on supervised CNNs, patch-frequency analysis (Spec), or handcrafted spectral/gram features, tend to memorize artifact patterns unique to specific generators (upsampling kernels, quantization noise, spectral biases) (Zhu et al., 2023, Huang et al., 27 Nov 2025). This leads to:
- Overfitting: Detectors achieve high accuracy on the generator seen during training, but perform poorly on samples from new architectures.
- Memorization of Peripheral Artifacts: Models latch onto high-frequency generator-specific traces (e.g., checkerboard patterns, diffusion edge halos) rather than intrinsic semantic inconsistencies, resulting in non-transferable decision boundaries.
A typical cross-domain test reveals abrupt performance collapse when the test generator differs from the training generator.
3. Foundation Model-Based Approaches: DINOv3, MAE, CLIP
Recent research demonstrates that large-scale frozen visual transformers (e.g., DINOv3, Masked Autoencoders, CLIP) yield significantly improved cross-generator generalization, often with minimal or no task-specific fine-tuning (Huang et al., 27 Nov 2025, Jang et al., 9 Nov 2025, Shi et al., 3 Aug 2025).
DINOv3:
A vision-only transformer backbone, trained by self-distillation, exhibits that patch tokens encode global low-frequency structures characteristic of real images—so-called "authenticity cues." Multiple analyses establish:
- Frequency Perspective: DINOv3’s accuracy is dominated by low-frequency components. Patch tokens, not the global CLS or register tokens, carry the separable cues, and performance collapses under high-pass filtering (Huang et al., 27 Nov 2025).
- Spatial Perspective: DINOv3 relies on globally coherent layouts; random masking has negligible impact while patch shuffling leads to significant accuracy loss (average 5.6%).
- Token-Level Analysis: Patch tokens alone reach 74.0% accuracy, matching or exceeding all-token usage.
Masked Autoencoders (MAE) and CINEMAE:
A MAE trained to reconstruct masked patches from visible context implicitly models the conditional distribution . By measuring conditional negative log-likelihood (NLL) and aggregating patch-level reconstruction anomalies, CINEMAE attains over 95% zero-shot accuracy across eight unseen generators in GenImage (Jang et al., 9 Nov 2025). Generator-agnostic anomaly scores are achieved by fusing local NLL statistics with global MAE features.
CLIP and Multimodal Alignment:
Discriminative representation learning (MiraGe) tightly aligns real/fake image features to text side anchors ("Real" vs. "Fake"). By minimizing intra-class variation and maximizing inter-class separation, multimodal prompts ground generator-invariant embeddings, yielding robust transfer to new generators (Shi et al., 3 Aug 2025).
4. Algorithms and Training-Free Generalization
State-of-the-art methods for cross-generator detection emphasize transferability, minimal overfitting, and interpretable decision mechanisms.
- Fisher-Guided Token Selection (FGTS):
Selects patch tokens by maximizing Fisher score (class means difference over sum of variances), enabling a lightweight linear probe that aggregates token embeddings for classification (Huang et al., 27 Nov 2025).
- Training-Free Strategies:
Frozen backbones (DINOv3, MAE, CLIP) are never updated; only a small probe (logistic regression or MLP head) is trained on reference images. For DINOv3, only 2k images suffice to achieve state-of-the-art performance with less than 8k learnable parameters.
- Semi-supervised Clustering (TriDetect):
By exploiting latent sub-structure within the "fake" class, methods like TriDetect use balanced Sinkhorn-Knopp assignments and cross-view consistency regularization to discover GAN vs. diffusion submanifolds, improving robustness under architectural domain shift (Nguyen-Le et al., 23 Nov 2025).
5. Quantitative Results and Comparative Analysis
Multiple large-scale benchmarks quantify the gains from foundation model-based algorithms.
| Method | GenImage Avg Acc (%) | So-Fake-OOD Avg Acc (%) | AIGCDetection Avg Acc (%) |
|---|---|---|---|
| DINOv3+FGTS+LinearProbe | 92.6 | 87.5 | 92.45 |
| CINEMAE (MAE-based) | 95.96 | — | — |
| MiraGe (CLIP-based) | 92.6 | — | 92.9 |
| TriDetect (CLIP+Sinkhorn) | 98.82 (AUC) | — | 98.69 (AUC) |
| CNNSpot, Spec, F3Net, GramNet | 64–74 | — | — |
Noteworthy, the FGTS+linear probe and CINEMAE methods consistently outperform artifact-based approaches and maintain high accuracy in zero-shot settings.
6. Interpretability, Limitations, and Extensions
Interpretability analyses show why foundation models generalize robustly:
- Global Coherence and Low-Frequency Structure:
Performance is maximized when low-frequency features and spatial layout are preserved, indicating reliance on authenticity cues rather than superficial texture (Huang et al., 27 Nov 2025).
- Anomaly Scores and Semantic Alignment:
Context-conditional reconstruction anomalies (CINEMAE) and multimodal anchoring (MiraGe) operate at the semantic level, not artifact-level, reinforcing generator-agnostic separability (Jang et al., 9 Nov 2025, Shi et al., 3 Aug 2025).
Limitations and open directions include:
- Reduced accuracy on ultra-realistic or adversarial fakes, e.g., Chameleon set, requiring finer-grained modeling.
- Sensitivity to heavily degraded (blurred, compressed) inputs is improved but not fully solved.
- Potential future extensions involve dynamic token selection, application to multimodal/video backbones, and fast adaptation to evolving generator families (Huang et al., 27 Nov 2025).
7. Domain Adaptation and Cross-Domain Classification Strategies
For settings where testing images reside in a different domain (e.g., cross-style translation), domain adaptation frameworks such as DIPS utilize unsupervised image-to-image translation models and pseudo-supervised checkpoint selection via Gaussian Mixture Models in feature space (Al-Hindawi et al., 2023). Pseudo-labels are assigned by clustering translated target-domain features and maximizing classifier prediction agreement using balanced accuracy, F1, or AUC. DIPS demonstrates near-perfect rank correlation with true supervised metrics and outperforms popular image quality criteria (FID) for model selection.
References
- Rethinking Cross-Generator Image Forgery Detection through DINOv3 (Huang et al., 27 Nov 2025)
- CINEMAE: Leveraging Frozen Masked Autoencoders for Cross-Generator AI Image Detection (Jang et al., 9 Nov 2025)
- Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection (Nguyen-Le et al., 23 Nov 2025)
- MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection (Shi et al., 3 Aug 2025)
- GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image (Zhu et al., 2023)
- Domain-knowledge Inspired Pseudo Supervision (DIPS) for Unsupervised Image-to-Image Translation Models to Support Cross-Domain Classification (Al-Hindawi et al., 2023)
- iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition (Wei et al., 2022)