Understanding the Evaluation of Text-to-Image Models with TIIF-Bench
The paper introduces TIIF-Bench, a specialized benchmark targeted at evaluating the adherence of modern Text-to-Image (T2I) models to user instructions. The development of TIIF-Bench addresses inherent limitations in existing benchmarks, such as insufficient prompt diversity and evaluation granularity, which have been obstacles to effectively measuring the alignment between textual instructions and generated images. This new benchmark provides a comprehensive analysis of T2I models through a detailed exploration of multiple dimensions, including prompt complexity, text rendering, and style control.
TIIF-Bench incorporates 5000 prompts diversified across three difficulty levels, accommodating variations in prompt length and semantic attributes. Notably, it emphasizes assessing the ability of T2I models to handle intricate textual instructions by introducing novel dimensions like text rendering and style control. The framework distinguishes itself by meticulously categorizing prompts into ten conceptually rich pools and combining them in various novel configurations reflective of real-world scenarios. Moreover, the benchmark includes 100 designer-level prompts curated to evaluate T2I models under advanced, high-fidelity aesthetic conditions.
One of the key methodological advancements TIIF-Bench offers is its evaluation protocol. Diverging from conventional CLIP-based metrics, which often fall short in capturing nuanced semantic alignment, TIIF-Bench utilizes a VLM-powered (Vision-LLM) strategy, allowing for attribute-specific and fine-grained evaluations. This involves generating specific yes/no questions associated with each prompt, leveraging VLMs to assess the alignment quality between the textual prompts and the visual outputs. Another significant contribution is the introduction of the GNED (Global Normalized Edit Distance) metric, a robust measure designed for evaluating text fidelity that accounts for both text accuracy and length discrepancies in rendered text.
The benchmarking results highlight several insightful patterns. Notably, GPT-4o stands out for its exceptional instruction-following capability, leading in nearly all evaluation dimensions owing to its strong autoregressive architecture and comprehension abilities. Moreover, the comparison between diffusion-based and autoregressive models reveals an interesting juxtaposition where, despite lower image fidelity, autoregressive models like Janus-Pro demonstrate competitive semantic understanding. Performance across different T2I models indicates a positive correlation between prompt comprehension robustness and prompt length resilience, especially in models such as FLUX.1 dev and SD 3.5.
The paper's empirical findings contribute significant insights into the development of next-generation T2I systems, emphasizing an apparent need to enhance instruction comprehension and maintain quality across different textual-set scenarios. Future directions proposed involve expanding the linguistic scope of prompts beyond English and exploring stylistic variations to further test T2I model flexibility and adaptability in generating diverse and contextually aligned visual content. Such efforts could potentially bridge existing gaps in understanding multimodal compositions, fostering advancements in AI-driven visual content generation.
Overall, TIIF-Bench emerges as a crucial tool in refining the evaluation landscape, offering nuanced and graded assessments of T2I models’ ability to follow complex instructions, directly contributing to the more grounded development of these models.