Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis (2306.09341v2)

Published 15 Jun 2023 in cs.CV, cs.AI, and cs.DB

Abstract: Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict human preferences on generated images. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academic, community and industry. The code and dataset is available at https://github.com/tgxs002/HPSv2 .

PDF Abstract

Evaluating Human Preferences in Text-to-Image Synthesis

The paper "Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis" presents a refined method for assessing the quality of images produced by text-to-image generative models. This research introduces Human Preference Dataset v2 (HPD v2) and its accompanying scoring metric, Human Preference Score v2 (HPS v2), which together offer a systematic approach to evaluating human preferences concerning generative image models.

Overview of Dataset and Scoring Model

HPD v2 represents a comprehensive dataset that includes 798,090 human preference annotations derived from pairwise comparisons of 433,760 image pairs. Significantly, this dataset covers a broad spectrum of nine text-to-image generative models, in addition to real images from the COCO Captions dataset. The authors emphasize addressing biases present in previous datasets, particularly those derived from the prevalent usage of certain models or stylistic prompts. To this end, they employed ChatGPT to cleanse prompt data from DiffusionDB, reducing stylistic biases and contradictions inherently present in human-written prompts.

The proposed model, HPS v2, is an advancement over existing metrics like Inception Score (IS), Fréchet Inception Distance (FID), and CLIP Score. These preceding metrics have been shown to poorly correlate with human preferences. By fine-tuning the CLIP model on the HPD v2, HPS v2 can predict human preferences more reliably, providing a responsive evaluation metric that acknowledges improvements in text-to-image generation algorithms.

Experimental Validation and Significance

Strong numerical results validate the advantage of HPS v2 over previous models. It demonstrated superior generalization capabilities across diverse image distributions compared to models such as HPS v1, ImageReward, and PickScore. These findings support its application as a preferable metric, offering a quantifiable measure that aligns closely with human judgment.

The practical implications of this research are significant. As the field of AI-generated content continues to expand, the ability to quantitatively evaluate image quality from a human-centric perspective becomes critically valuable. HPS v2 thus facilitates the development of algorithms that are more attuned to producing outputs that meet human aesthetic and qualitative expectations.

Future Prospects

Looking ahead, the establishment of a benchmark using HPS v2 paves the way for further advancements both within and beyond the scope of text-to-image synthesis. The paper demonstrates that using prompt categories spanning Animation, Concept Art, Painting, and Photo provides a stable and fair metric for model evaluation. Researchers now have the opportunity to explore enhancements in model architectures, optimization strategies, and prompt engineering that can be validated against this robust benchmark.

Furthermore, the potential for integrating mechanisms like retrieval-based initialization, as explored in the case studies within this paper, suggests future directions for increasing model performance through nuanced adjustments in generation methods.

Conclusions

This paper contributes significantly to the ongoing discourse around AI generative models and their evaluation. By formulating a benchmark grounded in extensive human preference data, the authors have enabled more aligned development and assessment of text-to-image synthesis models. The introduction of HPD v2 and HPS v2 advances the field by offering a metric that better encapsulates human preferences, ultimately supporting the development of algorithms that produce more human-aligned outputs. The availability of the dataset and code further encourages widespread application and refinement, promising continued progress in understanding and enhancing generative model performance.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Xiaoshi Wu (10 papers)
Yiming Hao (5 papers)
Keqiang Sun (20 papers)
Yixiong Chen (22 papers)
Feng Zhu (139 papers)
Rui Zhao (241 papers)
Hongsheng Li (340 papers)

Citations (143)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - tgxs002/HPSv2: Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis (294 stars)