Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OneIG-Bench: Omni T2I Evaluation

Updated 30 June 2025
  • OneIG-Bench is an omni-dimensional framework that evaluates text-to-image models on key aspects like text rendering, reasoning, and stylization.
  • It employs modular, category-specific metrics across six dimensions, enabling robust, reproducible comparison of T2I systems.
  • The benchmark supports multilingual prompts and diverse style assessments, offering actionable insights for both research and practical deployment.

OneIG-Bench is an omni-dimensional, fine-grained benchmark framework developed for the thorough evaluation of text-to-image (T2I) models. It is designed to address deficiencies in earlier T2I evaluation schemes, which historically failed to cover critical aspects such as reasoning ability, text rendering within generated images, and artistic stylization. OneIG-Bench provides a comprehensive, modular platform that supports nuanced, scientific comparison and diagnosis of T2I systems across all major facets of image generation.

1. Motivation and Benchmark Overview

The unprecedented progress of text-to-image generation models—encompassing high-fidelity synthesis, complex semantic understanding, and even basic reasoning—has outpaced the evaluative capacity of traditional benchmarks. Previous benchmarks focused primarily on prompt-image alignment or compositionality, neglecting more advanced capabilities observed in state-of-the-art models, such as robust text rendering within images or sophisticated reasoning-based image content.

OneIG-Bench is introduced to fill these gaps by offering a holistic, multi-perspective standard for T2I evaluation. It contains over 1000 real-world prompts, methodically curated and distributed across six rigorous evaluation dimensions relevant to practical T2I deployment.

2. Benchmark Structure and Categories

OneIG-Bench structures its evaluation around six principal dimensions, each detailed below along with methodological and metric-specific information:

  1. General Object: Prompts covering standard object or scene generation, supporting classic prompt-image alignment tests.
  2. Portrait: Focused on human/figure generation, including variety in pose, age, and demographic.
  3. Anime and Stylization: Artistic and stylized content, with prompts specifying visual style (e.g., clay model, pixel art, pointillism).
  4. Text Rendering: Prompts requiring the explicit inclusion and accurate rendering of textual content within the image.
  5. Knowledge and Reasoning: Tasks necessitating reasoning, background knowledge, or sequential logic (e.g., process illustration, commonsense depiction).
  6. Multilingualism: Prompt sets in English and Chinese, with topics reflecting linguistic and cultural specificity.

Each category contains approximately 200 prompts, selected and refined to represent diverse, real-world user needs. Prompts are extensively deduplicated and categorized by length (short, medium, long), ensuring robust coverage of both linguistic and visual complexity. Prompt construction and refinement employ a combination of human review and LLM-driven rewriting (e.g., using GPT-4o) to ensure clarity, sensitivity filtering, and consistent diversity.

3. Evaluation Metrics and Methodologies

OneIG-Bench applies custom, automated, and category-specific metrics for rigorous and granular model assessment:

a. Prompt-Image Alignment (Semantic Alignment)

  • Employs a question dependency graph for each prompt, generated by GPT-4o and covering semantic, spatial, and attribute information. For a given image, the Qwen2.5-VL-7B model answers these questions; the proportion of correct answers forms the alignment score.
  • Alignment Score per prompt:

Score=Number of Correct AnswersTotal Number of Questions\text{Score} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}}

  • Separate breakdowns by prompt category, prompt type (natural, tag, phrase), and length.

b. Text Rendering Precision

  • Evaluates the accuracy of model-rendered text within images, critical for applications like marketing and informational graphics.
  • Main metrics:

    • Edit Distance (ED): Mean Levenshtein distance per image-prompt.
    • Completion Rate (CR): Fraction of images with perfect string reproduction (ED=0ED = 0).
    • Word Accuracy (WAC): Fraction of correctly rendered words.
    • Composite Text Score:

    Stext=1min(ϕ,ED)(1CR)(1WAC)ϕS_\text{text} = 1 - \frac{\min(\phi, \mathrm{ED}) \cdot (1 - \mathrm{CR}) \cdot (1 - \mathrm{WAC})}{\phi}

    Where ϕ\phi is the normalization upper bound.

c. Reasoning-Generated Content (Knowledge and Reasoning)

  • Assesses whether generated images correspond to answers for prompts requiring nontrivial background knowledge or logical reasoning.
  • For each prompt, GPT-4o provides the canonical reasoning “answer”; LLM2CLIP is used to compute the cosine similarity between this answer and the generated image.
  • The reasoning score is the average cosine similarity across all samples.

d. Stylization

  • Evaluates the ability of models to follow explicit style prompts.
  • Each stylization task uses multiple reference images per style.
  • Style embeddings are computed with both CSD and CLIP-derived encoders. Cosine similarity between generated and reference image features is averaged:

S[csd,clip]=1nk=1n(1mi=1mmaxjcos(F(Gik),F(Rjk)))S_{[\text{csd}, \text{clip}]} = \frac{1}{n} \sum_{k=1}^n \bigg( \frac{1}{m} \sum_{i=1}^m \max_j \cos ( \mathcal{F}(G_i^k), \mathcal{F}(R_j^k) ) \bigg)

  • The final score is the mean across the two encoders.

e. Diversity

  • Quantifies intra-prompt variation across generated images.
  • DreamSim is used to embed images; pairwise cosine similarities are averaged and inverted to give a diversity score:

Sdiversity=1nk=1n[1Cm2i=1mj=i+1m(1SIMijk)]S_{\text{diversity}} = \frac{1}{n} \sum_{k=1}^n \left[ \frac{1}{C_m^2} \sum_{i=1}^m \sum_{j=i+1}^m (1 - \text{SIM}_{ij}^k) \right]

f. Multilingualism

  • Ensures that both alignment and text rendering are accurately assessed for prompts in English and Chinese, reflecting practical applicability in multicultural and multilingual contexts.

4. Comparison to Previous Benchmarks

Earlier T2I benchmarks such as T2ICompBench, GenEval, DSG-1k, and DPG-Bench primarily focus on compositionality, short-text prompts, or dense scene understanding, but lack systematic analysis in text rendering, reasoning, and stylistic adherence. WorldGenBench addresses some reasoning and background knowledge but misses comprehensive, multi-axis coverage and multilingual aspects.

OneIG-Bench uniquely delivers:

  • Simultaneous assessment of alignment, textuality, stylization, diversity, reasoning, and multilingual competence.
  • Explicitly modular design, allowing model evaluation over any subset of capabilities.
  • Automated, dimension-specific metrics tailored for large-scale, reproducible comparison—not solely relying on CLIP/FID or generic statistical summaries.
  • Systematic, controlled prompt diversity, type, and length, supporting robust evaluation across varied linguistic and visual domains.

5. Practical Application and Usage Flexibility

OneIG-Bench is structured for both full and partial evaluation protocols. Users may selectively evaluate T2I models on relevant categories (e.g., only text rendering and stylization for graphical design applications), with independent prompt sets and scoring scripts for each dimension.

Each category is provided as a separately indexed prompt set, and all scripts are available for public use, enabling direct and reproducible performance comparison. The modular nature supports granular model diagnosis—researchers can pinpoint strengths or limitations in specific abilities, facilitating targeted improvements or deployment guidance.

6. Availability and Community Impact

All benchmark data, including bilingual prompts, reference answers, images, and evaluation scripts, are openly released:

By enabling fair, reproducible, and fine-grained evaluation, OneIG-Bench serves as a central infrastructure for the T2I community. It allows for detailed cross-model, cross-dimension comparison and establishes best practices for scientific progress in T2I research and deployment.

7. Significance in the Text-to-Image Research Landscape

OneIG-Bench is the first benchmark to provide unified and systematic coverage of all salient T2I evaluation axes, facilitating nuanced, multidimensional model auditing. Its comprehensive design, coupled with custom automated metrics and multilingual, real-world prompt design, ensures continued relevance as T2I model capabilities rapidly broaden. This suggests a shift from monolithic, single-metric benchmarks toward category-based, modular frameworks in T2I and similar generative model assessments. Such a trend is likely to inform not only academic progress but also practical adoption in industry and creative domains.