Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-Generator Image Classification

Updated 17 January 2026
  • Cross-generator image classification is a technique that distinguishes real images from synthetic ones by analyzing universal authenticity cues rather than generator-specific artifacts.
  • It leverages foundation models such as DINOv3, MAE, and CLIP to achieve robust zero-shot detection and high accuracy across unseen generative models.
  • Benchmark results show significant improvements over artifact-based detectors, enabling efficient transfer learning, interpretability, and practical applications in visual forensics.

Cross-generator image classification is a paradigm in visual forensics and recognition where a model is required to distinguish real images from synthetic images ("forgeries") generated by a diverse and ever-evolving set of generative models. Rather than focusing on the artifacts or signatures specific to a single generator, robust cross-generator classification seeks universal cues of authenticity that generalize to images from unseen generators, including both GANs and diffusion models. This field has been accelerated by large-scale benchmarks (e.g., GenImage) and foundation architectures (e.g., DINOv3, MAE, CLIP), which expose the critical limitations of prior artifact-memorizing detectors and motivate the search for generator-invariant, interpretable, and efficient solutions.

1. Formal Problem Definition and Benchmarking

The cross-generator problem is formalized as learning a discriminant function fθ:I→{real,fake}f_\theta:\mathcal{I} \to \{\text{real}, \text{fake}\}, trained on samples from a set of seen generators Gseen\mathcal{G}_{\mathrm{seen}} and tested on a disjoint set of unseen generators Gunseen\mathcal{G}_{\mathrm{unseen}}, with Gseen∩Gunseen=∅\mathcal{G}_{\mathrm{seen}} \cap \mathcal{G}_{\mathrm{unseen}} = \varnothing (Huang et al., 27 Nov 2025, Zhu et al., 2023). For image xx:

x∼{preal(x),x∈Ireal pfake(g)(x),x∈Ifake(g),g∈Gx \sim \begin{cases} p_{\mathrm{real}}(x), & x \in \mathcal{I}_{\mathrm{real}} \ p_{\mathrm{fake}}^{(g)}(x), & x \in \mathcal{I}_{\mathrm{fake}}^{(g)},\quad g \in \mathcal{G} \end{cases}

Benchmarks such as GenImage provide large, multi-generator testbeds where detectors are trained on fakes from one generator and tested on the others. Typical datasets involve up to eight generators (Midjourney, SD V1.4/V1.5, ADM, GLIDE, Wukong, VQDM, BigGAN), with balanced splits: 162k fakes + 162k reals per training subset, and 6k fakes + 6k reals per test generator. Metrics include accuracy, AUC, and average precision across train–test generator pairs (Zhu et al., 2023). Table 1 below illustrates the characteristic drop in accuracy when testing on unseen generators for standard CNN or Transformer-based detectors.

Training ↓ Midjourney SD V1.4 ADM GLIDE Wukong BigGAN Avg
Midjourney 98.8 76.4 64.1 78.9 71.4 50.1 71.1
SD V1.4 54.9 99.9 53.5 61.9 98.2 52.0 72.1
ADM 58.6 53.1 99.0 97.1 53.0 88.3 70.4
GLIDE 50.7 50.0 56.0 99.9 50.3 74.0 60.2

Within-generator accuracy is near-perfect; cross-generator accuracy often collapses toward chance except for closely related architectures.

2. Failure Modes of Artifact-Memorizing Detectors

Traditional classifiers, whether based on supervised CNNs, patch-frequency analysis (Spec), or handcrafted spectral/gram features, tend to memorize artifact patterns unique to specific generators (upsampling kernels, quantization noise, spectral biases) (Zhu et al., 2023, Huang et al., 27 Nov 2025). This leads to:

  • Overfitting: Detectors achieve high accuracy on the generator seen during training, but perform poorly on samples from new architectures.
  • Memorization of Peripheral Artifacts: Models latch onto high-frequency generator-specific traces (e.g., checkerboard patterns, diffusion edge halos) rather than intrinsic semantic inconsistencies, resulting in non-transferable decision boundaries.

A typical cross-domain test reveals abrupt performance collapse when the test generator differs from the training generator.

3. Foundation Model-Based Approaches: DINOv3, MAE, CLIP

Recent research demonstrates that large-scale frozen visual transformers (e.g., DINOv3, Masked Autoencoders, CLIP) yield significantly improved cross-generator generalization, often with minimal or no task-specific fine-tuning (Huang et al., 27 Nov 2025, Jang et al., 9 Nov 2025, Shi et al., 3 Aug 2025).

DINOv3:

A vision-only transformer backbone, trained by self-distillation, exhibits that patch tokens encode global low-frequency structures characteristic of real images—so-called "authenticity cues." Multiple analyses establish:

  • Frequency Perspective: DINOv3’s accuracy is dominated by low-frequency components. Patch tokens, not the global CLS or register tokens, carry the separable cues, and performance collapses under high-pass filtering (Huang et al., 27 Nov 2025).
  • Spatial Perspective: DINOv3 relies on globally coherent layouts; random masking has negligible impact while patch shuffling leads to significant accuracy loss (average 5.6%).
  • Token-Level Analysis: Patch tokens alone reach 74.0% accuracy, matching or exceeding all-token usage.

Masked Autoencoders (MAE) and CINEMAE:

A MAE trained to reconstruct masked patches from visible context implicitly models the conditional distribution p(Xm∣Xv)p(X_{\rm m}\mid X_{\rm v}). By measuring conditional negative log-likelihood (NLL) and aggregating patch-level reconstruction anomalies, CINEMAE attains over 95% zero-shot accuracy across eight unseen generators in GenImage (Jang et al., 9 Nov 2025). Generator-agnostic anomaly scores are achieved by fusing local NLL statistics with global MAE features.

CLIP and Multimodal Alignment:

Discriminative representation learning (MiraGe) tightly aligns real/fake image features to text side anchors ("Real" vs. "Fake"). By minimizing intra-class variation and maximizing inter-class separation, multimodal prompts ground generator-invariant embeddings, yielding robust transfer to new generators (Shi et al., 3 Aug 2025).

4. Algorithms and Training-Free Generalization

State-of-the-art methods for cross-generator detection emphasize transferability, minimal overfitting, and interpretable decision mechanisms.

  • Fisher-Guided Token Selection (FGTS):

Selects patch tokens by maximizing Fisher score (class means difference over sum of variances), enabling a lightweight linear probe that aggregates token embeddings for classification (Huang et al., 27 Nov 2025).

  • Training-Free Strategies:

Frozen backbones (DINOv3, MAE, CLIP) are never updated; only a small probe (logistic regression or MLP head) is trained on reference images. For DINOv3, only 2k images suffice to achieve state-of-the-art performance with less than 8k learnable parameters.

  • Semi-supervised Clustering (TriDetect):

By exploiting latent sub-structure within the "fake" class, methods like TriDetect use balanced Sinkhorn-Knopp assignments and cross-view consistency regularization to discover GAN vs. diffusion submanifolds, improving robustness under architectural domain shift (Nguyen-Le et al., 23 Nov 2025).

5. Quantitative Results and Comparative Analysis

Multiple large-scale benchmarks quantify the gains from foundation model-based algorithms.

Method GenImage Avg Acc (%) So-Fake-OOD Avg Acc (%) AIGCDetection Avg Acc (%)
DINOv3+FGTS+LinearProbe 92.6 87.5 92.45
CINEMAE (MAE-based) 95.96 — —
MiraGe (CLIP-based) 92.6 — 92.9
TriDetect (CLIP+Sinkhorn) 98.82 (AUC) — 98.69 (AUC)
CNNSpot, Spec, F3Net, GramNet 64–74 — —

Noteworthy, the FGTS+linear probe and CINEMAE methods consistently outperform artifact-based approaches and maintain high accuracy in zero-shot settings.

6. Interpretability, Limitations, and Extensions

Interpretability analyses show why foundation models generalize robustly:

  • Global Coherence and Low-Frequency Structure:

Performance is maximized when low-frequency features and spatial layout are preserved, indicating reliance on authenticity cues rather than superficial texture (Huang et al., 27 Nov 2025).

  • Anomaly Scores and Semantic Alignment:

Context-conditional reconstruction anomalies (CINEMAE) and multimodal anchoring (MiraGe) operate at the semantic level, not artifact-level, reinforcing generator-agnostic separability (Jang et al., 9 Nov 2025, Shi et al., 3 Aug 2025).

Limitations and open directions include:

  • Reduced accuracy on ultra-realistic or adversarial fakes, e.g., Chameleon set, requiring finer-grained modeling.
  • Sensitivity to heavily degraded (blurred, compressed) inputs is improved but not fully solved.
  • Potential future extensions involve dynamic token selection, application to multimodal/video backbones, and fast adaptation to evolving generator families (Huang et al., 27 Nov 2025).

7. Domain Adaptation and Cross-Domain Classification Strategies

For settings where testing images reside in a different domain (e.g., cross-style translation), domain adaptation frameworks such as DIPS utilize unsupervised image-to-image translation models and pseudo-supervised checkpoint selection via Gaussian Mixture Models in feature space (Al-Hindawi et al., 2023). Pseudo-labels are assigned by clustering translated target-domain features and maximizing classifier prediction agreement using balanced accuracy, F1, or AUC. DIPS demonstrates near-perfect rank correlation with true supervised metrics and outperforms popular image quality criteria (FID) for model selection.

References

  • Rethinking Cross-Generator Image Forgery Detection through DINOv3 (Huang et al., 27 Nov 2025)
  • CINEMAE: Leveraging Frozen Masked Autoencoders for Cross-Generator AI Image Detection (Jang et al., 9 Nov 2025)
  • Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection (Nguyen-Le et al., 23 Nov 2025)
  • MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection (Shi et al., 3 Aug 2025)
  • GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image (Zhu et al., 2023)
  • Domain-knowledge Inspired Pseudo Supervision (DIPS) for Unsupervised Image-to-Image Translation Models to Support Cross-Domain Classification (Al-Hindawi et al., 2023)
  • iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition (Wei et al., 2022)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Generator Image Classification.