GenImage Benchmark for AI-Generated Detection

Updated 16 November 2025

GenImage Benchmark is a comprehensive evaluation suite for AI-generated image detection, featuring paired real and synthetic images across 1,000 classes.
It employs robust protocols to assess both in-distribution and cross-generator performance using metrics like accuracy, precision, recall, F1, and AUROC.
The benchmark identifies biases from compression and resolution differences and drives advancements with state-of-the-art methods and multi-task strategies.

GenImage Benchmark

The GenImage benchmark is a large-scale, multimodel evaluation suite for AI-generated image detection, primarily focused on distinguishing synthetic (AI-generated) imagery from real photographs. It has become a de facto standard in AIGC (AI-generated content) forensics due to its scale, diversity, and systematic challenge protocols that assess both in-distribution detection and cross-model generalization. GenImage's design enables evaluation of detectors under realistic conditions—across generator families, compression/quality regimes, image classes, and adversarial constraints.

1. Dataset Architecture and Composition

GenImage consists of paired real and AI-generated images across 1,000 object classes aligned with ImageNet-1K categories. Its construction uses eight SOTA generative models:

BigGAN: Class-conditional GAN, 128×128 pixels
ADM: Classifier-guided diffusion, 256×256
GLIDE: Free-guided large-scale diffusion, 256×256
VQDM: Vector-quantized latent diffusion, 256×256
Stable Diffusion V1.4/V1.5: Latent diffusion pre-trained on LAION-Aesthetics, 512×512
Midjourney V5: Proprietary, 1024×1024
Wukong: Large-scale Chinese diffusion model, 512×512

The structure is as follows:

Subset	Real Images	Fake Images	Source	Resolution
Each generator	~1.28M	~1.28M	ImageNet (real); GAN/DM	128–1024 px
Combined	1,331,167	1,350,000	8 Generators × 1k class	Varied

All real samples are from ImageNet, ensuring class-alignment for direct comparison. Generative images follow a per-class balancing strategy, yielding 1350 fake/1300 real per class (train), 50 each per class for testing.

JPEG compression and image size are unbalanced in the default version—ImageNet images are primarily JPEG Q=96; generated images are stored as PNG (lossless).

2. Evaluation Protocols and Core Metrics

The GenImage benchmark defines two principal evaluation tracks:

A. Cross-Generator Classification

Train a detector on one generator's images + real images; test on the remaining 7 generator subsets.
Compute mean per-subset accuracy (ACC), precision, recall, F₁, EER, and sometimes AUROC.
Main objective: assess generalization when the generator at test time differs from that seen during training.

B. Degraded Image Classification

Detectors are trained on clean images (usually SD V1.4).
Test data is degraded: downsampled (112×112 / 64×64), compressed (JPEG Q=65/30), or strongly blurred (σ=3/5).
Metrics as above, reported per degradation type.

Metric Formalization:

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

where $TP$ = true positives, $TN$ = true negatives, $FP$ = false positives, $FN$ = false negatives.

Aggregated metrics (e.g., cross-generator mean) are averages over the 7 held-out generator test sets.

3. Baseline Methods and SOTA Comparisons

GenImage is widely used to benchmark both general-purpose and domain-specific forensic classifiers. Backbone architectures include:

CNNs: ResNet-50, DeiT-Small, Swin-Tiny
Frequency-based: Spec (FFT spectrum), F3Net (frequency decomposition), GramNet (texture), CNNSpot (ResNet-50 + JPEG/blur aug)
Hybrid/Transformers: ViT-L/14 CLIP, DINOv2, segmentation/attention-based for pixel-level analysis

Performance is highly architecture- and protocol-dependent. Key table (ResNet-50, train on SD1.4):

Test Subset	Midj.	SD1.4	SD1.5	ADM	GLIDE	Wukong	VQDM	BigGAN	Avg
ResNet-50	54.9	99.9	99.7	53.5	61.9	98.2	56.6	52.0	72.1

Recent advances show that models using frozen masked autoencoders (CINEMAE), manipulation-augmented multi-task learning (GAMMA), and on-manifold adversarial fine-tuning (OMAT) can surpass 90–95% cross-generator accuracy, a significant leap over early CNN baselines (typically 65–75%) (Yan et al., 12 Sep 2025, Jang et al., 9 Nov 2025, Zhou et al., 1 Jun 2025).

4. Biases and Robustness: Compression and Size Control

A recurring concern is that detectors can exploit non-semantic artifacts—JPEG quality or resolution differences—rather than genuine generator fingerprints. GenImage's real and synthetic images are not, by default, controlled for such biases, leading to artifacts in training.

Bias-matched experiments filter real images for JPEG Q=96, compress all generated images to the same quality, and select size-matched crops:

Cross-generator accuracy improves by >11 percentage points (e.g., ResNet50: 71.7% → 82.7%) (Grommelt et al., 2024).
Robustness to JPEG Q=80/60 drops less after constraint alignment (e.g., from 54% to ~67% at Q=95).

The consensus is that future benchmarks and models should match compression and resolution between real/fake classes to evaluate true generative artifacts.

5. Protocol Variants and Extensions

GenImage has spawned several forks and challenge extensions:

GenImage++ introduces advanced diffusion generators (Flux.1, SD3), long-form prompts, 183 style templates, and photorealistic attribute prompts. It is a test-only set, designed to defeat memorization and shortcut learning (Zhou et al., 1 Jun 2025).
GenHard / GenExplain: Subsets focusing on challenging “hard” samples (those misclassified by strong baselines) and rationalized flaw explanations attached to synthetic images (Wu et al., 8 Mar 2025).
Manipulation-based and reasoning-based detection: Multi-task supervision (classification + segmentation heads), structured explanation pipelines (e.g., ThinkFake’s chain-of-thought + expert agents), and multi-level reward RL increasingly dominate state-of-the-art generalization (Huang et al., 24 Sep 2025, Yan et al., 12 Sep 2025).

6. Scientific Impact and Open Challenges

GenImage has driven methodological advances and exposed fundamental challenges in AIGC detection:

Generator-invariant detection requires learning features orthogonal to JPEG, size, or prompt bias.
Strong cross-generator performance is only achieved by regularization and domain-augmentation (e.g., manipulation-mixing, adversarial latent training).
Interpretability and segmentation: Chain-of-thought pipelines, segmentation maps, and error explanations are increasingly required for practical deployment.
Open problems remain in:
1. Zero-shot detection for unseen/future generators;
2. High robustness to aggressive post-processing (compression, stylization, adversarial manipulation);
3. Multiple domains (faces, scenes, artwork, medical, remote sensing).

GenImage’s influence pervades both direct detection research and broader applications, such as explainable AI and multimodal pipeline evaluation.

7. Representative Results (Selected Benchmarks)

Model/Class	Cross-Generator ACC (%)	Robustness (JPEG Q=80, % ACC)
ResNet-50 (raw)	~72	50.6
ResNet-50 (bias-constrained)	82.7	59.4
GAMMA	95.1	>92 at Q=75
CINEMAE	95.96	n/a
LoL (1% data)	92.7	n/a
OMAT + CLIP+LoRA	96.78 (GenImage++)	n/a
ThinkFake (MLLM)	84.0	n/a

Highest performance requires bias control and model-specific regularization strategies.

8. Summary and Recommendations

GenImage has become the canonical testbed for AIGC detection across real-world domains. Key recommendations for future benchmark design and detector development include:

Bias matching: Rigorously control JPEG/size between sources.
Cross-generator protocols: Always report both in-distribution and cross-generator accuracy.
Dataset extension: Incorporate new generators, prompting paradigms, and multi-domain samples.
Interpretability and robustness: Combine detection with segmented flaw localization and structured explanations.

With its million-scale design, public availability, and comprehensive protocol, GenImage will likely remain a central resource for research into robust, scalable, and generalizable AI image detection for the foreseeable future.