AGIN Database: Benchmarking AI-Generated Image Naturalness
- AGIN Database is a multi-perspective, human-labeled benchmark for assessing visual naturalness in AI-generated images, utilizing dual annotations of technical and rationality distortions.
- It comprises 6,049 images across five generative tasks, with detailed ratings from 30 human evaluators using controlled quality protocols.
- Its robust metadata structure, APIs, and comprehensive distortion taxonomy enable effective benchmarking, error diagnostics, and model improvement in generative AI research.
The AGIN Database (AI-Generated Image Naturalness database) is a large-scale, multi-perspective human-labeled resource designed to benchmark and facilitate the assessment of visual naturalness in AI-generated images. AGIN provides a reference standard for evaluating generative models, analyzing the specific sources of visual artifacts, and developing computational predictors that align with human perceptions of both low-level technical and high-level rationality distortions. Its structure, rating protocols, and taxonomy of distortions enable the detailed interrogation and modeling of naturalness characteristics that are critical for the advancement of generative AI evaluation methodologies (Chen et al., 2023).
1. Composition and Scope
AGIN comprises 6,049 images, including 5,849 AI-generated images spanning five major generative image tasks and 200 natural reference images used for control and benchmarking against the upper distribution of perceived naturalness. All images are systematically annotated with per-image, per-task, and per-model metadata:
| Task Type | Model Count | Image Count |
|---|---|---|
| Text-to-Image | 5 | 2,140 |
| Image Translation | 5 | 1,406 |
| Image Inpainting | 2 | 670 |
| Image Colorization | 3 | 806 |
| Image Editing | 3 | 827 |
Data sources for generating prompts and initial images include established datasets such as COCO, ADE20K, FFHQ, CelebA-HQ, and synthetically constructed collections like MagicBrush. Model coverage includes canonical generators such as Stable Diffusion v1.5/v2.1, Openjourney, Dreamlike, RealisticVision (text-to-image), RABIT, DiffuseIT, StyleCLIP, MATEBIT, CoCosNet (translation), RePaint, MAT (inpainting), PDNLA-Net, DDColor, DISCO (colorization), and DragGAN, InstructPix2Pix, MagicBrush (editing) (Chen et al., 2023).
2. Human Subjective Rating Protocol
Each image in AGIN was evaluated by 30 subjects (18 male, 12 female, mean age 22.6±3.1), all with normal or corrected-to-normal vision. The assessment protocol followed a single-stimulus paradigm, displaying images on a 27″, 2560×1440 px monitor (viewing distance ≈ 70 cm) under ITU-BT.500 compliant conditions. Subjects provided ratings in three distinct aspects using custom, anchored 5-point Likert scales:
- Overall Naturalness (MOS) – 1 to 5.
- Technical Quality (MOS_T) – rating and main low-level (T-factor) source.
- Rationality Quality (MOS_R) – rating and main high-level (R-factor) source.
Quality control integrated ten “golden” test images per session, with sessions passing only if subjects rated ≥70% of controls within ±1 of ground truth. Each image received 30 ratings on each dimension (totaling 90 ratings per image), and subjects selected one principal distortion type from defined lists whenever a flaw was detected (Chen et al., 2023).
3. Distortion Taxonomy
AGIN employs a dual-perspective distortion taxonomy:
- Low-level Technical (T-factors):
- T-1: Luminance (exposure failures)
- T-2: Contrast (dynamic range errors)
- T-3: Detail (texture paucity or excess)
- T-4: Blur (focus and acuity defects)
- T-5: Artifacts (structural anomalies, seams, or false pixels)
- High-level Rationality (R-factors):
- R-1: Existence (physically/semantically impossible objects)
- R-2: Color (unnatural hues)
- R-3: Layout (illogical element arrangement)
- R-4: Context (semantic mismatches)
- R-5: Sensory Clarity (scene ambiguity)
On each trial, the rater not only supplied an overall score but also attributed the leading technical and/or rationality factor when degradation was observed, providing granular error case annotation (Chen et al., 2023).
4. Statistical Analysis and Correlations
AGIN presents statistical distributions and correlation matrices that formalize the relationship between its three principal measures:
- Score Ranges:
- MOS: [1.28, 5.0]
- MOS_T: [1.39, 5.0]
- MOS_R: [1.17, 5.0]
- Distribution Characteristics:
- Most images are rated 3–5 (moderate to high perceived naturalness), with a marked tail toward low scores for images with severe artifacts or semantic deviations.
- Correlation Analysis:
- Spearman/Pearson between MOS_T and MOS: ≈ 0.82/0.81
- Spearman/Pearson between MOS_R and MOS: ≈ 0.91/0.92
- Best linear fit:
This indicates that high-level rationality failures predict perceived unnaturalness more powerfully than technical issues, but both contribute independently (Chen et al., 2023).
5. Metadata Structure, Organization, and Access
AGIN provides comprehensive metadata and robust programmatic access:
- Image-level data fields: filename, task, model type, MOS, MOS_T, MOS_R, principal T-factor, principal R-factor.
- Data format: Delivered as CSV/JSON files with supporting scripts for loading and phase-specific data splits.
- APIs: The official Python loader offers a
AGINDatasetclass supporting selection by data phase (“train”/“val”/“test”) and outcome perspective (“naturalness”, “technical”, “rationality”), enforcing clean train-test splits by prompt to prevent data leakage. - Repository: Hosted at https://github.com/zijianchen98/AGIN, including code and detailed folder structure for reproducibility and extensibility by the research community (Chen et al., 2023).
6. Benchmarking, Applications, and Computational Modeling
AGIN’s exhaustive, annotation-rich data curation enables usage for:
- Human-aligned benchmarking: Establishing objective ground-truth for the evaluation of generative models across multiple visual synthesis tasks.
- Machine learning model development: Training and evaluation of algorithms to predict not only raw MOS but causal attribution in terms of technical/rationality distortions (e.g., the JOINT model learns to jointly model these perspectives and aligns closely with aggregated human ratings).
- Error diagnostics and model comparison: Systematic inter-model and intra-task analysis for model improvement, including rank-ordering of distortions and failure typologies across architectures and datasets (Chen et al., 2023).
The database’s unique structure facilitates the deconvolution of technical versus semantic artifacts on human perception, supporting rigorous development of both diagnostic and generative capabilities within computer vision research.
7. Significance and Comparisons
AGIN is distinct from earlier INA/NR-IQA resources in its comprehensive scale, explicit bifurcation of technical and rationality factors, and multi-task coverage of contemporary generative models. It is the first dataset to provide dual-perspective, fine-grained annotations at scale, with controlled subjectivity across all images, enabling both robust benchmarking and the training of interpretable predictors directly aligned with human visual judgment (Chen et al., 2023). A plausible implication is that, as generative models increasingly target photorealism and semantic coherence, AGIN will remain a central reference for design, evaluation, and improvement of both subjective and objective naturalness metrics under emerging generative paradigms.