Effect of embedding and prompt conditioning choices on Vendi Score-based autoevaluation

Investigate whether selecting different image and multimodal embedding models and conditioning prompts for computing Vendi Score diversity scores leads to improved agreement with human-annotated rankings of attribute-specific diversity across text-to-image models, and determine which specific embeddings and conditioning prompts yield better performance than those evaluated (Inception, ViT, DINOv2, CLIP, and PALI variants with attribute/object conditioning).

Background

The paper evaluates automated diversity metrics for text-to-image models by pairing the Vendi Score with various embedding spaces (Inception, ViT, DINOv2, CLIP, and PALI) and different text conditioning schemes (attribute-only and object+attribute). While autoevaluation results broadly align with human judgments, the authors find limited and inconsistent advantages from text conditioning and note that performance depends on the embedding and conditioning choices.

Given these observations, the authors explicitly state that it remains to be determined whether alternative embedding models and conditioning prompts could yield better autoevaluation alignment with human diversity assessments, leaving this as a concrete direction for future investigation.

References

It is possible that better choices of models and conditioning prompts can lead to better results, but we leave this question open for future investigation.

Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation (2511.10547 - Albuquerque et al., 13 Nov 2025) in Section 3, Subsection “Ranking models with autoevaluation approaches” (label: sec:autoeval-ranking)