Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation (2305.01569v2)

Published 2 May 2023 in cs.CV and cs.AI

Abstract: The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public. To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users' preferences over generated images. We leverage this dataset to train a CLIP-based scoring function, PickScore, which exhibits superhuman performance on the task of predicting human preferences. Then, we test PickScore's ability to perform model evaluation and observe that it correlates better with human rankings than other automatic evaluation metrics. Therefore, we recommend using PickScore for evaluating future text-to-image generation models, and using Pick-a-Pic prompts as a more relevant dataset than MS-COCO. Finally, we demonstrate how PickScore can enhance existing text-to-image models via ranking.

PDF HTML Abstract

An Analysis of the "Pick-a-Pic" Dataset and the PickScore Scoring Function for Text-to-Image Generation

The paper "Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation" presents a comprehensive approach to enhancing the field of text-to-image generation through the collection of a large-scale dataset of user preferences and the development of a superior scoring function, PickScore. This research addresses a significant gap in the availability of large, publicly accessible datasets that model genuine user preferences in text-to-image tasks and proposes a novel framework for generating and evaluating model outputs.

The authors introduce a web application that facilitates the generation of images from user-provided text prompts. Crucially, this application captures user preferences, thereby enabling the accumulation of a dataset that truly reflects user interests. This dataset, named Pick-a-Pic, comprises over a million examples, each consisting of a text prompt and a pair of generated images, with user annotations indicating image preference or ties. This stands in contrast to datasets built from crowd-sourcing platforms, where annotators often lack intrinsic motivation or personal investment in the prompts they evaluate.

The dataset serves as a basis for training PickScore, a CLIP-based scoring function with the primary objective of predicting user preferences effectively. PickScore is trained using a preference prediction objective akin to InstructGPT's reward model, which optimizes the probability that a preferred image is selected over a less preferred one. The results indicate an accuracy rate of 70.5% in predicting user preferences, surpassing the 68.0% rate achieved by human annotators. This finding underscores the utility of employing real-world user-generated data for training models that exceed human-level performance in specific domains.

The paper further discusses the implications of using PickScore as an evaluation metric for text-to-image generation models. In comparing PickScore with the traditionally used FID metric and other aesthetics-based predictors, the authors demonstrate that PickScore correlates more closely with human judgment. This suggests that Pick- a-Pic's prompts better capture the nuances of user interests and expectations, providing a valuable standard for assessing the performance of text-to-image algorithms.

In addition to evaluation, PickScore is also leveraged for improving output quality through image ranking. By selecting the image with the highest PickScore from a selection of generated images, the authors show that the selected outputs are preferred by users over those chosen by other scoring systems. This technique highlights the utility of PickScore as a mechanism for enhancing the quality of generated images in practice.

This research has wide-reaching implications for the future development of text-to-image systems. Beyond the immediate benefits in evaluation and generation quality, the methodology outlined in this work could inform other domains where modeling genuine user preferences is pivotal. Future work could explore the application of reinforcement learning with human feedback (RLHF) techniques to further align model outputs with intricate user requirements.

In conclusion, the Pick-a-Pic dataset and PickScore underscore a shift towards more robust, preference-aware frameworks in generative AI, underlining the importance of authentic user interaction data. Through open access to these resources, the research community is equipped to advance state-of-the-art text-to-image models, aligning their outputs more closely with human expectations and preferences.

PDF Markdown Bookmark Chat (Pro)

References (18)

Authors (6)

Yuval Kirstain (10 papers)
Adam Polyak (29 papers)
Uriel Singer (20 papers)
Shahbuland Matiana (4 papers)
Joe Penna (2 papers)
Omer Levy (70 papers)

Citations (234)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/Andrew__Brown__/status/1746631685335114086

YouTube

Show All Videos