Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models (2304.05390v2)

Published 11 Apr 2023 in cs.CV, cs.AI, and cs.LG

Abstract: In recent years, Text-to-Image (T2I) models have been extensively studied, especially with the emergence of diffusion models that achieve state-of-the-art results on T2I synthesis tasks. However, existing benchmarks heavily rely on subjective human evaluation, limiting their ability to holistically assess the model's capabilities. Furthermore, there is a significant gap between efforts in developing new T2I architectures and those in evaluation. To address this, we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on limited aspects, HRS-Bench measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes. We evaluate nine recent large-scale T2I models using metrics that cover a wide range of skills. A human evaluation aligned with 95% of our evaluations on average was conducted to probe the effectiveness of HRS-Bench. Our experiments demonstrate that existing models often struggle to generate images with the desired count of objects, visual text, or grounded emotions. We hope that our benchmark help ease future text-to-image generation research. The code and data are available at https://eslambakr.github.io/hrsbench.github.io

This paper introduces HRS-Bench, a new benchmark designed for the comprehensive evaluation of Text-to-Image (T2I) generative models. The authors argue that existing benchmarks often rely heavily on subjective human evaluations, cover a limited set of skills, and lack scalability, hindering holistic assessment of model capabilities.

To address these limitations, HRS-Bench focuses on being Holistic, Reliable, and Scalable.

1. Holism:

  • It evaluates T2I models across 13 distinct skills grouped into five major categories:
    • Accuracy: Counting, Visual Text generation, Emotion grounding, and Fidelity.
    • Robustness:
    • Invariance: Consistency (paraphrasing prompts), Typos (handling noisy input).
    • Equivariance: Spatial composition, Attribute-specific composition (color, size), Action composition.
    • Generalization: Creativity (generating novel, imaginative images).
    • Fairness: Performance consistency across subgroups (gender, style).
    • Bias: Tendency to generate specific attributes (gender, race, age) in the absence of explicit instruction.
  • The benchmark covers 50 diverse scenarios (e.g., fashion, animals, food, transportation) using a large prompt dataset.

2. Reliability:

  • While acknowledging the value of human evaluation, HRS-Bench emphasizes automatic evaluation metrics for scalability and consistency.
  • It introduces a novel T2I alignment metric called AC-T2I (Augmented Captioner-based T2I alignment). This metric aims to overcome the limitations of existing metrics like CLIPScore, particularly in evaluating compositional understanding.
    • AC-T2I works by first generating captions for the T2I output image using a captioning model (e.g., BLIP2).
    • It then augments the original text prompt using GPT-3.5 to create a set of diverse but semantically similar ground-truth texts.
    • Finally, it compares the generated image captions against the augmented prompt set using n-gram based metrics (CIDEr, BLEU) and selects the maximum similarity score.
  • Other metrics employed include:
    • Detection-based: UniDet for object detection (used for counting, spatial/attribute composition), TextSnake/SAR for visual text detection/recognition (CER, NED).
    • Alignment-based (besides AC-T2I): Standard T2I alignment (CLIPScore), I2I alignment (CLIP similarity between images generated from original vs. perturbed prompts), visual emotion classification, face detection/attribute recognition for bias (MAD score).
  • The benchmark's effectiveness was validated through a human evaluation paper covering 10% of the data, showing high alignment (95% on average) with the automatic metrics.

3. Scalability:

  • The benchmark utilizes 45,000 prompts in total (3k per skill).
  • Prompts were collected using a combination of filtering from existing datasets and generation via templates filled using object lists (LVIS) and refined using GPT-3.5.
  • Prompts are categorized into easy, medium, and hard difficulty levels for fine-grained analysis.
  • The reliance on automatic metrics allows for easy application to new T2I models.

Experiments and Findings:

  • Nine T2I models were evaluated, including Stable Diffusion (V1, V2), DALL-E 2, GLIDE, CogView2, Paella, minDALL-E, DALL-E Mini, and Struct-Diff.
  • Key Findings:
    • Models struggle significantly with counting objects accurately, especially beyond simple cases. Detailed prompts surprisingly yielded better recall/F1 than simpler ones.
    • Generating correct visual text within images remains a major challenge for all tested models. Models often confuse rendering the concept of the text with rendering the letters.
    • Models fail to reliably generate images grounded in specific emotions. Both automatic metrics and human evaluation showed poor performance.
    • Compositionality (spatial, attribute, action) is difficult, particularly for medium and hard prompts where most models scored near zero. Even composition-focused models like Struct-Diff showed limited improvement.
    • Models are generally robust to prompt paraphrasing (consistency) and typos.
    • Generating truly creative (novel yet prompt-aligned) images is challenging; models often generate images close to the training data distribution.
    • Models exhibit good fairness (similar performance across gender/style subgroups).
    • However, models show slight bias, particularly gender bias, when generating humans from agnostic prompts.

Conclusion:

HRS-Bench provides a more comprehensive, reliable, and scalable framework for evaluating T2I models. The results highlight significant weaknesses in current state-of-the-art models regarding complex reasoning, compositionality, visual text, and emotion grounding. The authors hope the benchmark will guide future research towards addressing these limitations. The code and data are publicly available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Eslam Mohamed Bakr (8 papers)
  2. Pengzhan Sun (10 papers)
  3. Xiaoqian Shen (14 papers)
  4. Faizan Farooq Khan (7 papers)
  5. Li Erran Li (37 papers)
  6. Mohamed Elhoseiny (102 papers)
Citations (59)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub