TIIF-Bench: How Does Your T2I Model Follow Your Instructions? (2506.02161v1)

Published 2 Jun 2025 in cs.CV

Abstract: The rapid advancements of Text-to-Image (T2I) models have ushered in a new phase of AI-generated content, marked by their growing ability to interpret and follow user instructions. However, existing T2I model evaluation benchmarks fall short in limited prompt diversity and complexity, as well as coarse evaluation metrics, making it difficult to evaluate the fine-grained alignment performance between textual instructions and generated images. In this paper, we present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions. TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities. To rigorously evaluate model robustness to varying prompt lengths, we provide a short and a long version for each prompt with identical core semantics. Two critical attributes, i.e., text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models. In addition, we collect 100 high-quality designer level prompts that encompass various scenarios to comprehensively assess model performance. Leveraging the world knowledge encoded in large vision LLMs, we propose a novel computable framework to discern subtle variations in T2I model outputs. Through meticulous benchmarking of mainstream T2I models on TIIF-Bench, we analyze the pros and cons of current T2I models and reveal the limitations of current T2I benchmarks. Project Page: https://a113n-w3i.github.io/TIIF_Bench/.

PDF Abstract

Understanding the Evaluation of Text-to-Image Models with TIIF-Bench

The paper introduces TIIF-Bench, a specialized benchmark targeted at evaluating the adherence of modern Text-to-Image (T2I) models to user instructions. The development of TIIF-Bench addresses inherent limitations in existing benchmarks, such as insufficient prompt diversity and evaluation granularity, which have been obstacles to effectively measuring the alignment between textual instructions and generated images. This new benchmark provides a comprehensive analysis of T2I models through a detailed exploration of multiple dimensions, including prompt complexity, text rendering, and style control.

TIIF-Bench incorporates 5000 prompts diversified across three difficulty levels, accommodating variations in prompt length and semantic attributes. Notably, it emphasizes assessing the ability of T2I models to handle intricate textual instructions by introducing novel dimensions like text rendering and style control. The framework distinguishes itself by meticulously categorizing prompts into ten conceptually rich pools and combining them in various novel configurations reflective of real-world scenarios. Moreover, the benchmark includes 100 designer-level prompts curated to evaluate T2I models under advanced, high-fidelity aesthetic conditions.

One of the key methodological advancements TIIF-Bench offers is its evaluation protocol. Diverging from conventional CLIP-based metrics, which often fall short in capturing nuanced semantic alignment, TIIF-Bench utilizes a VLM-powered (Vision-LLM) strategy, allowing for attribute-specific and fine-grained evaluations. This involves generating specific yes/no questions associated with each prompt, leveraging VLMs to assess the alignment quality between the textual prompts and the visual outputs. Another significant contribution is the introduction of the GNED (Global Normalized Edit Distance) metric, a robust measure designed for evaluating text fidelity that accounts for both text accuracy and length discrepancies in rendered text.

The benchmarking results highlight several insightful patterns. Notably, GPT-4o stands out for its exceptional instruction-following capability, leading in nearly all evaluation dimensions owing to its strong autoregressive architecture and comprehension abilities. Moreover, the comparison between diffusion-based and autoregressive models reveals an interesting juxtaposition where, despite lower image fidelity, autoregressive models like Janus-Pro demonstrate competitive semantic understanding. Performance across different T2I models indicates a positive correlation between prompt comprehension robustness and prompt length resilience, especially in models such as FLUX.1 dev and SD 3.5.

The paper's empirical findings contribute significant insights into the development of next-generation T2I systems, emphasizing an apparent need to enhance instruction comprehension and maintain quality across different textual-set scenarios. Future directions proposed involve expanding the linguistic scope of prompts beyond English and exploring stylistic variations to further test T2I model flexibility and adaptability in generating diverse and contextually aligned visual content. Such efforts could potentially bridge existing gaps in understanding multimodal compositions, fostering advancements in AI-driven visual content generation.

Overall, TIIF-Bench emerges as a crucial tool in refining the evaluation landscape, offering nuanced and graded assessments of T2I models’ ability to follow complex instructions, directly contributing to the more grounded development of these models.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Xinyu Wei (15 papers)
Jinrui Zhang (19 papers)
Zeqing Wang (17 papers)
Hongyang Wei (5 papers)
Zhen Guo (76 papers)
Lei Zhang (1689 papers)

TIIF-Bench: How Does Your T2I Model Follow Your Instructions? (2506.02161v1)

Understanding the Evaluation of Text-to-Image Models with TIIF-Bench

GitHub

YouTube

TIIF-Bench: How Does Your T2I Model Follow Your Instructions? (2506.02161v1)

Understanding the Evaluation of Text-to-Image Models with TIIF-Bench

Related Papers

GitHub

YouTube