TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering (2303.11897v3)

Published 21 Mar 2023 in cs.CV

Abstract: Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a LLM. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. TIFA is a reference-free metric that allows for fine-grained and interpretable evaluations of generated images. TIFA also has better correlations with human judgments than existing metrics. Based on this approach, we introduce TIFA v1.0, a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.). We present a comprehensive evaluation of existing text-to-image models using TIFA v1.0 and highlight the limitations and challenges of current models. For instance, we find that current text-to-image models, despite doing well on color and material, still struggle in counting, spatial relations, and composing multiple objects. We hope our benchmark will help carefully measure the research progress in text-to-image synthesis and provide valuable insights for further research.

PDF Abstract

Evaluating Text-to-Image Faithfulness: The TIFA Framework

The paper introduces TIFA (Text-to-Image Faithfulness evaluation with question Answering) as an advanced automatic evaluation metric that measures the faithfulness of images generated by text-to-image models concerning their corresponding textual prompts. The significant challenge addressed here is the frequent failure of current generation models to produce images that accurately reflect the input text. TIFA provides an innovative solution by leveraging visual question answering (VQA) to assess alignment between generated images and text prompts.

Overview of TIFA

TIFA is unique in its approach, combining LLMs with VQA to create a nuanced assessment framework. The process involves generating a set of question-answer pairs based on the input text using a LLM such as GPT-3. These pairs are designed to cover a broad spectrum of image elements described in the prompt. The generated questions are filtered using a question-answering model for validation. For evaluation, various VQA models are used to answer these questions about the generated image. The faithfulness of the image is quantified by the accuracy of the VQA in answering these questions, providing a fine-grained evaluation metric that is both interpretable and more aligned with human judgments than existing techniques.

TIFA Benchmark

Alongside the TIFA metric, the authors present TIFA v1.0, a benchmark comprising 4,081 diverse text prompts and 25,829 generated questions across 12 categories, including objects, activities, attributes, and spatial relations. This benchmark serves to systematically evaluate text-to-image models and provides tools for applying TIFA across a range of text inputs and image generation systems. The benchmark ensures a standard for comparing different models' abilities and highlights specific areas such as counting or spatial reasoning where current models may still face significant challenges.

Experimental Evaluation

The empirical evaluation using TIFA v1.0 demonstrates the robustness of the TIFA metric compared to existing evaluation metrics like CLIPScore and caption-based scoring approaches. Notably, TIFA achieves higher correlation with human judgments, validating its efficacy in measuring text-to-image fidelity. The paper's experiments cover several text-to-image models, including different versions of Stable Diffusion and VQ-Diffusion, showing a progressive improvement in model performance over time. However, notable gaps remain, especially in handling complex scenes involving multiple entities or intricate spatial relationships.

Implications and Future Directions

Theoretical implications of this work underscore the importance of VQA-based evaluation for detailed insight into model capabilities, offering more informative feedback than previous methods focused on simple scoring mechanisms. Practically, TIFA provides a more reliable metric for guiding the development and fine-tuning of text-to-image models, assisting researchers in pinpointing specific deficiencies in model outputs.

Future research could expand TIFA's application scope, such as customizing benchmarks to focus on specific aspects like abstract art or highly detailed scenes. Additionally, as visual question answering models improve, the accuracy and applicability of TIFA are expected to expand, potentially encompassing more domains and modalities like video or 3D generation. Such developments promise a broadening of TIFA's utility, establishing it as a vital tool in the AI landscape for assessing generative models.

In conclusion, TIFA represents a significant advancement in evaluating the alignment of generated images to their textual descriptions, moving beyond prior limitations and setting the stage for further research and improvement in text-to-image synthesis.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yushi Hu (23 papers)
Benlin Liu (11 papers)
Jungo Kasai (38 papers)
Yizhong Wang (42 papers)
Mari Ostendorf (57 papers)
Ranjay Krishna (116 papers)
Noah A Smith (3 papers)

Citations (162)

View on Semantic Scholar