Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation (2406.08482v2)

Published 12 Jun 2024 in cs.CV and cs.CL

Abstract: Diffusion models are the state of the art in text-to-image generation, but their perceptual variability remains understudied. In this paper, we examine how prompts affect image variability in black-box diffusion-based models. We propose W1KP, a human-calibrated measure of variability in a set of images, bootstrapped from existing image-pair perceptual distances. Current datasets do not cover recent diffusion models, thus we curate three test sets for evaluation. Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy, and our calibration matches graded human judgements 78% of the time. Using W1KP, we study prompt reusability and show that Imagen prompts can be reused for 10-50 random seeds before new images become too similar to already generated images, while Stable Diffusion XL and DALL-E 3 can be reused 50-200 times. Lastly, we analyze 56 linguistic features of real prompts, finding that the prompt's length, CLIP embedding norm, concreteness, and word senses influence variability most. As far as we are aware, we are the first to analyze diffusion variability from a visuolinguistic perspective. Our project page is at http://w1kp.com.

Summary

The paper introduces the human-calibrated W1KP metric to quantitatively measure image variability from text prompts.
The paper validates W1KP on three curated datasets across diffusion models, achieving up to an 18-point accuracy improvement over baselines.
The paper identifies 56 linguistic features influencing image diversity, guiding optimal prompt reusability for state-of-the-art models.

Measuring and Understanding Perceptual Variability in Text-to-Image Generation

Overview

Text-to-image generation has advanced considerably with the advent of diffusion models, achieving state-of-the-art results in generating high-quality images from textual descriptions. Despite these advancements, the perceptual variability—how different images are that are generated from the same prompt—remains underexplored. The paper "Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation" addresses this gap by introducing a human-calibrated measure of image variability called W1KP, based on existing perceptual distance metrics.

Key Contributions

W1KP Metric Development: The researchers propose the W1KP (Words of a Thousand Pictures) measure to quantify the perceptual variability of images generated from text prompts. This measure is bootstrapped from perceptual distances between pairs of images and calibrated to correspond to graded human judgements of similarity.
Evaluation and Validation: The paper curates three new datasets and validates the W1KP framework against nine baseline measures. The proposed metric outperforms these baselines significantly, with improvements up to 18 points in accuracy.
Prompt Reusability: Using the W1KP metric, the paper investigates how many times a text prompt can be reused with different random seeds before the generated images become too similar. Findings show that prompts can be reused 10–50 times for Imagen and 50–200 times for Stable Diffusion XL (SDXL) and DALL-E 3.
Linguistic Feature Analysis: The paper identifies 56 linguistic features influencing the variability in generated images, highlighting that prompt length, CLIP embedding norm, concreteness, and the number of word senses are the most influential.

Methodology

W1KP leverages a systematic approach to measure and understand the perceptual variability in generated images:

Normalization and Calibration: After computing pairwise perceptual distances between images, these are normalized to a standard uniform distribution. The scores are then calibrated to human judgements to ensure interpretable thresholds of low, medium, and high similarity.
Diffusion Models Studied: The paper evaluates three state-of-the-art diffusion models—Stable Diffusion XL (SDXL), DALL-E 3, and Imagen—using the W1KP measure. These models differ in architectural nuances, affecting their variability and prompt reusability.
Statistical Analysis: Exploring the 56 linguistic features through exploratory factor analysis uncovers four key factors: style keyword presence, syntactic complexity, linguistic unit length, and semantic richness. Following this, a confirmatory lexical analysis demonstrates the correlation of these features with perceptual variability among different diffusion models.

Numerical Results

The evaluation results underscore W1KP's robustness:

DreamSim $_{\ell_2}$ , the backbone perceptual distance model, achieves the best performance with up to 78.3% accuracy, surpassing other models and baselines.
Calibration metrics show that W1KP scores align with human-judged similarity levels with a high accuracy of 78%.

Implications and Future Directions

The implications of these findings are twofold:

Practical Applications: The practical application in graphical asset creation is evident, enabling creators to understand the reusability of prompts to generate diverse images without redundancy, particularly crucial in artistic and design tasks.
Theoretical Contributions: From a theoretical standpoint, the paper bridges linguistic constructs with visual generation variability, enriching the understanding of cross-modal interactions. The identification of key linguistic features that statistically correlate with visual variability guides future research in optimizing prompt engineering for desired variabilities.

Future developments might explore training-time influences on variability, expanding the annotated datasets for a more thorough classification, and investigating additional factors such as classifier-free guidance. There is also potential in leveraging these insights to refine and calibrate newer models beyond the diffusion paradigm.

Conclusion

This paper provides a rigorous exploration into perceptual variability within text-to-image generation, presenting a novel metric and thorough analyses that add significant value to the field. By bridging the gap between linguistic characteristics and visual outputs, it sets the stage for nuanced future research in both practical AI applications and theoretical AI development.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (8)

Tweets

https://twitter.com/lintool/status/1801272993571623092