- The paper presents GRADE, a method for quantifying sample diversity in T2I models without relying on reference images.
- It employs a combination of LLM and VQA systems to measure diversity across 400 concept-attribute pairs using entropy.
- Results indicate an inverse-scaling law where larger models yield less diverse outputs due to underspecified training data.
A Formal Analysis of "GRADE: Quantifying Sample Diversity in Text-to-Image Models"
The paper "GRADE: Quantifying Sample Diversity in Text-to-Image Models," introduces a novel method for evaluating diversity in outputs generated by text-to-image (T2I) models when given underspecified prompts. The researchers address two critical questions: do T2I models generate diverse outputs under such conditions, and how can this diversity be measured? The authors propose GRADE (Granular Attribute Diversity Evaluation), a method that evaluates diversity without relying on reference images, offering an improvement over traditional metrics like Frechet Inception Distance (FID) and Precision-and-Recall, which have limitations in capturing the nuanced diversity in T2I model outputs.
The paper outlines that current T2I models often default to generating outputs with limited variation, highlighting a default behavior phenomenon. This behavior is noted when models consistently produce similar images with only a few distinct attributes, such as cookies being predominantly round despite the variability that should be expected. To understand this phenomenon, GRADE uses a combination of a LLM and visual question-answering (VQA) systems to evaluate images based on concept-specific axes of diversity (e.g., shape, color). Using entropy as a diversity measure, GRADE finds that even state-of-the-art models like FLUX.1-dev display low diversity benchmarks.
The researchers employ GRADE to evaluate the diversity of 12 prominent T2I models across 400 concept-attribute pairs, revealing the limited diversity in their outputs. One of the key findings is the negative correlation between model size and diversity: larger models often show less diversity in their outputs. This suggests an inverse-scaling law in model behavior contrary to linear expectations of diversity boosting with scale.
Additionally, the paper probes the cause of low diversity and attributes it primarily to non-diverse images in model training datasets. When training data captions are underspecified, the corresponding images lack diversity, leading the models to repeat this homogeneity in generated outputs. Experiments show a strong correlation between training data diversity and generated image diversity, corroborating the hypothesis that underspecified training data fosters a lack of diversity in T2I outputs.
The implications for the field are significant, particularly in improving training datasets to enhance diversity in generated images, addressing bias, and refining evaluation metrics for more accurate model assessments. Future work, as suggested by the authors, could focus on enriching training data diversity, developing training methods that inherently promote diversity, and extending GRADE to explore relationships between different concepts and attributes simultaneously.
In conclusion, the paper effectively challenges the current paradigms of diversity evaluation in T2I models, providing a granular approach with GRADE that decouples from reference dependency and aligns closer with real-world expectations of diversity. Such comprehensive assessments and insights are pivotal for the advancement of T2I systems towards generating more varied, creative, and ultimately more useful visual content.