Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation (2509.21227v1)

Published 25 Sep 2025 in cs.CV and cs.CL

Abstract: Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at \href{https://amirkasaei.com/eval-the-evals/}{this URL}.

Summary

The paper presents a comprehensive analysis of 12 evaluation metrics for compositional text-to-image generation, assessing their alignment with human judgment.
It categorizes metrics into embedding-based, VQA-based, and image-only families, detailing their performance across eight compositional challenges.
The study advocates multi-metric evaluation, highlighting the importance of careful metric selection for training and optimizing generative T2I systems.

Evaluating Metrics for Compositional Text-to-Image Generation

Introduction

The evaluation of compositional text-to-image (T2I) generation models is a critical bottleneck in the advancement of multimodal generative systems. While recent models such as Stable Diffusion and DALL-E have demonstrated impressive capabilities in generating visually appealing images from textual prompts, the challenge remains in reliably assessing whether generated images faithfully reflect the compositional semantics—objects, attributes, and relations—specified in the input text. This paper presents a comprehensive analysis of twelve widely adopted evaluation metrics, spanning embedding-based, content-based (VQA-style), and image-only families, and benchmarks their alignment with human judgment across eight compositional categories using the T2I-CompBench++ dataset.

Taxonomy of Evaluation Metrics

The study categorizes metrics into three principal families:

Embedding-based Metrics: These include CLIPScore, PickScore, HPS, ImageReward, and BLIP-2, which quantify text-image alignment via similarity in a shared representation space or by leveraging models trained on human preference data.
Content-based (VQA-based) Metrics: Metrics such as VQAScore, TIFA, DA Score, DSG, and B-VQA operationalize compositional alignment as a visual question answering task, querying the image for entities, attributes, and relations derived from the prompt.
Image-only Metrics: CLIP-IQA and Aesthetic Score focus on perceptual quality and aesthetics, independent of textual alignment.

This taxonomy reflects the diverse approaches to quantifying compositional faithfulness, with each family targeting distinct aspects of the text-image correspondence.

Experimental Design and Benchmarks

The evaluation leverages the T2I-CompBench++ benchmark, which provides a fine-grained categorization of compositional challenges: entity existence, attribute binding (color, shape, texture), spatial relations (2D/3D), non-spatial relations, numeracy, and complex prompts. The dataset comprises 2,400 text-image pairs generated by six state-of-the-art T2I models, each annotated with human ratings. The metrics are assessed for their correlation with human judgment using Spearman, Pearson, and Kendall statistics, and their joint predictive power is analyzed via linear regression.

Correlation and Regression Analysis

Category-wise Metric Performance

The results reveal that no single metric consistently achieves high correlation with human scores across all compositional categories. For example, DA Score leads in color, while ImageReward excels in shape and texture. VQA Score is strongest for 2D spatial relations and complex prompts, whereas DSG is optimal for 3D spatial relations. TIFA is most effective for numeracy. Image-only metrics (CLIP-IQA, Aesthetic) are consistently weak, confirming their limited utility for compositional evaluation.

Regression Insights

Linear regression models fitted per category indicate that embedding-based metrics (HPS, PickScore, ImageReward) and VQA-based metrics (DA Score, VQA Score, TIFA) provide complementary signals. The relative importance of each metric shifts depending on the compositional challenge, and some metrics (e.g., CLIP-IQA, Aesthetic) receive negligible or negative coefficients, further underscoring their lack of relevance for compositional alignment.

Distributional Properties

The distribution of metric scores exhibits distinct patterns:

Figure 1: Value distributions of all analyzed metrics over T2I-CompBench++ generations (bin counts normalized by frequency).

Embedding-based metrics (CLIPScore, HPS, BLIP-2) are concentrated in mid-range values, limiting their discriminative power.
ImageReward spans a broader range, offering better separation.
VQA-based metrics are right-skewed and often saturate near 1.0, reflecting their quasi-binary nature and reduced ability to distinguish strong candidates.
Image-only metrics show either narrow (Aesthetic) or wide (CLIP-IQA) distributions, but neither correlates well with compositional faithfulness.

These findings highlight two major concerns: restricted value ranges in embedding-based metrics and saturation in VQA-based metrics, both of which can undermine their effectiveness as evaluation signals.

Implications for Metric Selection and Model Development

The analysis demonstrates that metric selection must be context-dependent and transparent, as reliance on a single metric can misrepresent model performance. The complementary strengths of embedding-based and VQA-based metrics suggest that ensemble approaches or multi-metric evaluation protocols are necessary for robust assessment. Furthermore, the limitations identified in score distributions and category-specific performance have direct implications for the use of these metrics as reward models in training and inference-time optimization of T2I systems.

For practical deployment, developers should:

Avoid using CLIPScore or image-only metrics as sole evaluation criteria for compositional alignment.
Prefer ImageReward, HPS, DA Score, and VQA Score, but tailor metric selection to the specific compositional challenge.
Consider combining metrics from different families to capture a broader spectrum of alignment properties.
Be aware of saturation and mid-range compression effects when interpreting metric outputs, especially in reward-based optimization.

Theoretical and Practical Implications

The study underscores the need for more faithful and discriminative evaluation metrics that can generalize across compositional categories. The observed limitations motivate future research into hybrid metrics that integrate structured reasoning with learned representations, as well as the development of benchmarks that better capture the diversity of compositional challenges. The findings also have implications for reinforcement learning and inference-time optimization strategies, where reward model selection can significantly impact the compositional fidelity of generated images.

Conclusion

This work provides a rigorous comparative analysis of compositional T2I evaluation metrics, revealing that no single metric is universally reliable. Embedding-based and VQA-based metrics each contribute valuable but incomplete signals, and their effectiveness varies by compositional category. The results advocate for multi-metric evaluation and careful metric selection to ensure trustworthy assessment and progress in compositional T2I generation. Future research should focus on developing metrics with improved discriminative power and generalizability, as well as on refining benchmarks to better reflect real-world compositional demands.