Q-Real Bench: Fine-Grained AI Imagery Eval
- Q-Real Bench is a comprehensive benchmark suite that offers entity-level and attribute-level evaluations to measure realism and plausibility in AI-generated images.
- It employs a rigorous annotation pipeline using GPT-4o, object grounding, and expert human verification to capture both perceptual defects and semantic inconsistencies.
- The benchmark supports ObjectQA and ImageQA tasks with metrics such as IoU, AUC, and LLM-Score, guiding targeted improvements in multimodal language models.
Q-Real Bench is a comprehensive evaluation benchmark suite introduced to advance fine-grained assessment of realism and plausibility in AI-generated images. Developed to address the inadequacies of traditional single-score quality datasets, it introduces entity-level and attribute-level judgments that disambiguate distinct failure modes in generative models, particularly in the context of multimodal LLMs (MLLMs). Q-Real Bench is built on the Q-Real dataset, which combines dense, expert-verified annotations with systematically structured reasoning tasks, enabling both discriminative and grounding-focused evaluation of state-of-the-art models (Wang et al., 21 Nov 2025).
1. Dataset Construction and Annotation Pipeline
The Q-Real dataset comprises 3,088 images synthesized by ten leading text-to-image diffusion models, including but not limited to Flux.1.dev, Stable Diffusion 3.0, PixArt, Hunyuan-DiT, Kcolors, Dreamina, Midjourney, Lumina-T2X, WanX, and DALL·E 3. Images span a prompt set of 534 diverse text descriptions that elicit varied content, interaction, and visual complexity, covering persons, animals, generic objects, and complex composite scenes.
From these images, 17,879 entities are localized using a semi-automated pipeline:
- Entity Extraction: GPT-4o is applied to the image and prompt to enumerate visible objects.
- Object Grounding: Grounding DINO outputs bounding box coordinates for each detected entity.
- Pre-Scoring: Q-Eval-Score assigns preliminary quality scores () to object crops.
- Initial Annotation Generation: GPT-4o generates two realism and four plausibility judgment questions, plus free-form rationales, per entity.
- Human Verification: Three trained annotators adjudicate binary judgments via majority voting, and verify or expand rationale text for every entity.
Entities are annotated at the object level with 2 realism and 4 plausibility binary questions, plus free-form explanations for each dimension. For a 400-image human-figure subset, plausibility explanations are produced on a per-anatomical-part basis before merging into whole-object attributions.
2. Task Definitions and Input/Output Specification
Q-Real Bench defines two principal evaluation settings:
- ObjectQA (entity-level binary judgment): Given a cropped entity and an explicitly phrased realism or plausibility question, the model answers Yes/No.
- ImageQA (image-level grounding and reasoning): Given a full AI-generated image, the model must both (1) localize all objects exhibiting a specified class of error (realism or plausibility) by returning labels and bounding coordinates, and (2) provide concise, targeted rationales for each predicted region.
The input/output formats are as follows:
- ObjectQA:
- Input:
<entity crop> <question> - Output:
"Yes"/"No"
- Input:
- ImageQA:
- Input:
<image> Analyze this AI-generated image, identify all entities with type labels, coordinates, and explain plausibility for each. - Output: List of
<label> <x,y,w,h> <explanation>triples.
- Input:
3. Evaluation Metrics
The assessment framework employs standard supervised classification and object detection metrics, computed separately for realism and plausibility:
- ObjectQA:
- Accuracy, Precision, Recall, F1 score.
- Area Under the ROC Curve (AUC).
- ImageQA:
- Intersection over Union (IoU): per-box matching via optimal one-to-one assignment (Hungarian algorithm).
- Detection Rate at IoU ≥ 0.5.
- Grounding AUC: area under the curve as IoU threshold is swept.
- LLM-Score: A GPT-4o-based semantic similarity between model- and human-generated explanations, outperforming BLEU and embedding-cosine in human correlation.
Metrics are strictly reported for both realism and plausibility axes, facilitating differential diagnosis of generative model strengths and weaknesses.
4. Fine-Tuning Architecture and Training Protocol
Models are instruction-tuned MLLMs with Transformer backbones (Qwen2.5-VL-7B, InternVL2.5-8B, LLaVA-v1.6-mistral-7B):
- Vision Encoder: Visual backbone (CLIP/ViT-like).
- Text Encoder: Large-scale transformer LLM.
- Fusion: Cross-modal layers (multi-head attention overlays).
- Training: LoRA is applied for parameter-efficient adaptation, focusing on cross-modal, language heads.
- ObjectQA: Batch=4, 3 epochs, standard binary cross-entropy.
- ImageQA: Batch=2, 5 epochs, token-level cross-entropy.
- The training-test split is fixed, with 400 reserved images out of the 3,088 for held-out evaluation.
Fine-tuning proceeds on fixed prompt templates and deterministic splits to ensure comparability.
5. Experimental Results and Analysis
Finetuned MLLMs demonstrate substantial improvements over zero-shot GPT-4o:
| Model (finetuned where *) | Realism IoU | Det. Rate | AUC | LLM-Score | Plaus. IoU | Det. Rate | AUC | LLM-Score |
|---|---|---|---|---|---|---|---|---|
| GPT-4o | 0.141 | 0.054 | 0.153 | 0.343 | 0.146 | 0.056 | 0.153 | 0.219 |
| Qwen2.5-VL-7B* | 0.660 | 0.626 | 0.620 | 0.731 | 0.648 | 0.621 | 0.604 | 0.603 |
| InternVL2.5-8B* | 0.498 | 0.503 | 0.454 | 0.732 | 0.518 | 0.503 | 0.456 | 0.588 |
| llava-v1.6-mistral-7B* | 0.484 | 0.438 | 0.411 | 0.743 | 0.512 | 0.446 | 0.422 | 0.597 |
- ObjectQA yields up to 72.9% accuracy (Qwen2.5-VL-7B*) for realism and 78.6% for plausibility.
- ImageQA grounding IoU rises above 0.6 for realism after finetuning.
- LLM-Score for reasoning (rationales) exceeds 0.7 for best finetuned models on realism; plausibility explanations remain more challenging (∼0.60).
- On 100-portrait-image subset, LLaVA* achieves IoU=0.632 and LLM-Score=0.479 in plausibility, signifying a marked improvement over zero-shot capabilities.
Pre-finetuning, MLLMs conflate perceptual with semantic errors (e.g., blur with implausible anatomy); finetuning on Q-Real Bench delivers clear segmentation between these axes and boosts both bounding-box and explanation quality.
Ablation studies reveal that unified, task-specific finetuning for ImageQA leads to better grounding versus multistage or pipeline approaches, reducing compounding errors in detection→reasoning cascades.
6. Motivation, Implications, and Broader Significance
Q-Real Bench is predicated on the distinction between realism (physical/perceptual fidelity: texture, lighting, material) and plausibility (semantic, world-consistent configuration: anatomy, logical object relationships). Collapsing these into a single quality score obscures critical system-level failures for real-world deployments in e-commerce, medical, or human-presentable contexts.
By providing orthogonal axes for diagnosis, Q-Real Bench enables:
- Targeted identification of generative failure modes (perceptual vs. semantic).
- Conditioning of generator training on realism/plausibility–aware loss terms.
- Benchmarking and cross-model evaluation for both detection (grounding) and explanation (reasoning), giving structure to fine-tuning and ablation studies.
- Direct guidance for iterative optimization in end-to-end AI content creation pipelines.
- Framework extensibility to future modalities such as video or 3D content, and to finer-grained attributes (composition, object co-occurrence, etc.).
Dataset and code will be open-sourced to promote reproducibility and further research (Wang et al., 21 Nov 2025).
In summary, Q-Real Bench inaugurates a structured, highly granular benchmark for the assessment of realism and plausibility in AI-generated images, integrating dense multi-annotator supervision, binary and free-form evaluations, and a rigorous, extensible protocol for the evaluation of modern MLLMs on both discriminative and generative perception tasks. It provides a new gold standard for diagnosis and repair of failures in generative vision-LLMs.