GenImage Dataset: AI-Generated Image Benchmark
- GenImage dataset is a rigorously curated, million-scale benchmark pairing real ImageNet images with synthetic counterparts from eight generative models.
- It controls for biases like JPEG compression and image size by using bias-controlled splits and standardized preprocessing to improve detector performance.
- The dataset supports robust evaluation via cross-generator and degraded image classification protocols, with metrics such as accuracy and AUC.
The GenImage dataset is a million-scale, rigorously curated benchmark designed to advance the detection of AI-generated images. Built on the full ImageNet-1k validation set and paired with an approximately equal number of synthetic images from diverse state-of-the-art generative models, GenImage provides a controlled, high-diversity testbed for real-vs-fake image classification across challenging cross-generator and degraded-data regimes. It explicitly addresses confounding factors such as JPEG compression and input size, enabling reliable measurement of detector performance and generalization (Zhu et al., 2023, Grommelt et al., 2024, Chivaran et al., 7 Jul 2025).
1. Dataset Structure and Image Composition
GenImage consists of nearly 2.7 million images, partitioned into real and synthetic classes that are balanced both at dataset scale and within each generator/class pairing. The composition details are as follows:
| Subset | Image Count (Real) | Image Count (Fake) | Image Source | Generators |
|---|---|---|---|---|
| Training | 1,281,167 | 1,300,000 | ImageNet-1k | 8 models (see below) |
| Test | 50,000 | 50,000 | ImageNet-1k | 8 models (see below) |
| Total | 1,331,167 | 1,350,000 | ImageNet-1k (real), generators (fake) | 8 |
Every ImageNet class (1,000 in total) is paired with fake images produced using the prompt template “photo of {class}”, except for models like BigGAN and ADM which take categorical labels directly. For each class, approximately 1,350 fake images and 1,331 real images are included, yielding uniform class distribution (Zhu et al., 2023, Chivaran et al., 7 Jul 2025).
The “fake” subset comprises eight distinct generative sources:
- One GAN: BigGAN (128×128)
- Seven diffusion-based/text-to-image models: ADM (256×256), GLIDE (256×256), VQDM (256×256), Stable Diffusion V1.4 (512×512), Stable Diffusion V1.5 (512×512), Wukong (512×512), Midjourney (1024×1024).
Real images are directly sourced from ImageNet with no additional curation or filtering, maintaining the full diversity of the original dataset (Grommelt et al., 2024).
2. Generator Diversity and Data Partitioning
GenImage is constructed with eight “splits”, each representing a unique generator. In each split, real/synthetic image pairing is balanced for class and subset size. This stratification ensures that for every generator, training and validation sets contain real images corresponding by class or prompt set to the synthetic data for that generator.
| Generator | Model Type | Native Resolution | Conditioning | Language |
|---|---|---|---|---|
| BigGAN | GAN | 128×128 | class conditional | English |
| ADM | Diffusion (DDPM) | 256×256 | class conditional | English |
| GLIDE | Diffusion | 256×256 | text conditional | English |
| VQDM | Diffusion | 256×256 | text conditional | English |
| Stable Diff. 1.4 | Diffusion | 512×512 | text conditional | English |
| Stable Diff. 1.5 | Diffusion | 512×512 | text conditional | English |
| Wukong | Diffusion | 512×512 | text conditional | Chinese |
| Midjourney | Diffusion | 1024×1024 | text conditional | English |
Each generator’s data is paired with a corresponding, non-overlapping real ImageNet subset, ensuring no mixing across splits and maximal sampling diversity (Zhu et al., 2023, Grommelt et al., 2024).
3. Preprocessing, Resolutions, and Bias Controls
GenImage contains both “raw” and “bias-controlled” subsets to address two critical dataset biases:
- JPEG Compression Bias: Most ImageNet real images are JPEG-encoded (modal Q ≃ 96, range Q ∈ [70, 100]), while fakes are typically lossless PNG (Q = ∞). Detectors trained on the raw dataset can trivially exploit the compression difference as a “shortcut” for detection.
- Image Size Bias: Generator outputs are of fixed, model-specific sizes; in contrast, ImageNet real images show a broad, multimodal size distribution. This creates a source of leakage—detectors may key onto size artifacts rather than generation mechanisms.
To mitigate these biases:
- Bias-controlled splits recompress both real and synthetic images to a uniform JPEG Q = 96.
- Real image selection is controlled so that native size distributions align with those of the generative split (e.g., both real and fake images center-cropped to 450×450 before resizing for detector input).
- Preprocessing for model pipelines enforces symmetric operations (resize, crop, resize) for both classes (Grommelt et al., 2024).
4. Evaluation Protocols and Metrics
GenImage enables the evaluation of generative-image detectors through two principal tasks:
1. Cross-Generator Image Classification
- Train a binary classifier (real/fake) on images from one generator and test on the others.
- Primary metric: Average accuracy over all eight train/test generator combinations.
2. Degraded Image Classification
- Apply typical degradations (low-resolution, JPEG compression, Gaussian blur) to test images alone.
- Metric: Classification accuracy under each degradation in addition to standard accuracy.
Formal definitions:
- Accuracy:
- AUC:
- For bias quantification:
- Compression-accuracy gap:
- Image-size impact:
This structure enables rigorous assessment of detector generalization and robustness to domain shifts and low-level artifacts (Zhu et al., 2023, Grommelt et al., 2024).
5. Empirical Findings and Recommendations
Key findings include:
- Detectors trained on the raw splits overfit to JPEG-compression and size artifacts; e.g., a ResNet50 trained on uncompressed (PNG) fakes can lose ∼15 percentage points when testing with JPEG Q=95 images, indicating severe compression bias (Grommelt et al., 2024).
- After bias equalization (JPEG Q=96 for all images; native resolution matching), cross-generator performance increases substantially. For example, ResNet50’s average score rises from 71.68% to 82.74% (+11.06 pp), and Swin-T from 74.09% to 85.83% (+11.74 pp) when trained on the 512×512 generators (Grommelt et al., 2024).
- Similar improvements (+13.3 pp at JPEG Q=95) are observed for robustness to JPEG compression (Grommelt et al., 2024).
- Detectors tested within a generator split achieve near-perfect accuracy (>98%), but generalization across generators (under naive protocols) remains challenging (best models reach ≃70% average accuracy) (Zhu et al., 2023).
Recommended usage practices:
- Always match JPEG quality distributions between classes (e.g., force Q = 96 or randomly sample within the same Q range).
- Align native size distributions; train per-resolution or subsample real images accordingly.
- Apply identical preprocessing (resize, crop, resize) to both classes.
- Report cross-generator and degradation robustness across diverse Q-factors and size bins, not just at point estimates (Grommelt et al., 2024).
6. Downstream Application and Domain Usage
GenImage has been leveraged in studies of lightweight, efficient detection pipelines (e.g., LAID framework), where subsets (100K train, 16K val, 16K test) are used for benchmarking both spatial and frequency-domain models (Chivaran et al., 7 Jul 2025). In these contexts, further transformations—such as zero-centered 2D FFTs for spectral analysis—have been studied, with fusion ensembles improving adversarial robustness.
The dataset’s diversity in generative sources and careful construction makes it suitable for comprehensive real-vs-fake detection challenges. Its flexibility in supporting both spatial and spectral evaluations is notable. Ongoing work also emphasizes that, despite these advances, generalization to new generators and robustness under heavy degradations remain open problems, underpinning the continuing relevance of benchmarks like GenImage (Chivaran et al., 7 Jul 2025, Grommelt et al., 2024, Zhu et al., 2023).
7. Availability and Comparative Context
The GenImage dataset, along with both raw and bias-controlled splits plus training scripts, is publicly accessible at https://www.unbiased-genimage.org (Grommelt et al., 2024). By design, it is broader and more balanced than most prior datasets, offering an order-of-magnitude increase in image and domain diversity. While datasets such as GIM (Chen et al., 2024) provide large-scale benchmarks for local manipulation detection and localization, GenImage’s primary focus is on global real-vs-fake image discrimination across generator types and input perturbations.
A plausible implication is that best-practice benchmarks for AI-generated content detection must rigorously control for spurious cues—such as compression or size differences—to ensure fair and robust evaluation. This insight, concretized in GenImage, is increasingly reflected in the protocols for new dataset and detector releases in this field.