This paper introduces HRS-Bench, a new benchmark designed for the comprehensive evaluation of Text-to-Image (T2I) generative models. The authors argue that existing benchmarks often rely heavily on subjective human evaluations, cover a limited set of skills, and lack scalability, hindering holistic assessment of model capabilities.
To address these limitations, HRS-Bench focuses on being Holistic, Reliable, and Scalable.
1. Holism:
- It evaluates T2I models across 13 distinct skills grouped into five major categories:
- Accuracy: Counting, Visual Text generation, Emotion grounding, and Fidelity.
- Robustness:
- Invariance: Consistency (paraphrasing prompts), Typos (handling noisy input).
- Equivariance: Spatial composition, Attribute-specific composition (color, size), Action composition.
- Generalization: Creativity (generating novel, imaginative images).
- Fairness: Performance consistency across subgroups (gender, style).
- Bias: Tendency to generate specific attributes (gender, race, age) in the absence of explicit instruction.
- The benchmark covers 50 diverse scenarios (e.g., fashion, animals, food, transportation) using a large prompt dataset.
2. Reliability:
- While acknowledging the value of human evaluation, HRS-Bench emphasizes automatic evaluation metrics for scalability and consistency.
- It introduces a novel T2I alignment metric called AC-T2I (Augmented Captioner-based T2I alignment). This metric aims to overcome the limitations of existing metrics like CLIPScore, particularly in evaluating compositional understanding.
- AC-T2I works by first generating captions for the T2I output image using a captioning model (e.g., BLIP2).
- It then augments the original text prompt using GPT-3.5 to create a set of diverse but semantically similar ground-truth texts.
- Finally, it compares the generated image captions against the augmented prompt set using n-gram based metrics (CIDEr, BLEU) and selects the maximum similarity score.
- Other metrics employed include:
- Detection-based: UniDet for object detection (used for counting, spatial/attribute composition), TextSnake/SAR for visual text detection/recognition (CER, NED).
- Alignment-based (besides AC-T2I): Standard T2I alignment (CLIPScore), I2I alignment (CLIP similarity between images generated from original vs. perturbed prompts), visual emotion classification, face detection/attribute recognition for bias (MAD score).
- The benchmark's effectiveness was validated through a human evaluation paper covering 10% of the data, showing high alignment (95% on average) with the automatic metrics.
3. Scalability:
- The benchmark utilizes 45,000 prompts in total (3k per skill).
- Prompts were collected using a combination of filtering from existing datasets and generation via templates filled using object lists (LVIS) and refined using GPT-3.5.
- Prompts are categorized into easy, medium, and hard difficulty levels for fine-grained analysis.
- The reliance on automatic metrics allows for easy application to new T2I models.
Experiments and Findings:
- Nine T2I models were evaluated, including Stable Diffusion (V1, V2), DALL-E 2, GLIDE, CogView2, Paella, minDALL-E, DALL-E Mini, and Struct-Diff.
- Key Findings:
- Models struggle significantly with counting objects accurately, especially beyond simple cases. Detailed prompts surprisingly yielded better recall/F1 than simpler ones.
- Generating correct visual text within images remains a major challenge for all tested models. Models often confuse rendering the concept of the text with rendering the letters.
- Models fail to reliably generate images grounded in specific emotions. Both automatic metrics and human evaluation showed poor performance.
- Compositionality (spatial, attribute, action) is difficult, particularly for medium and hard prompts where most models scored near zero. Even composition-focused models like Struct-Diff showed limited improvement.
- Models are generally robust to prompt paraphrasing (consistency) and typos.
- Generating truly creative (novel yet prompt-aligned) images is challenging; models often generate images close to the training data distribution.
- Models exhibit good fairness (similar performance across gender/style subgroups).
- However, models show slight bias, particularly gender bias, when generating humans from agnostic prompts.
Conclusion:
HRS-Bench provides a more comprehensive, reliable, and scalable framework for evaluating T2I models. The results highlight significant weaknesses in current state-of-the-art models regarding complex reasoning, compositionality, visual text, and emotion grounding. The authors hope the benchmark will guide future research towards addressing these limitations. The code and data are publicly available.