Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation (2305.01569v2)

Published 2 May 2023 in cs.CV and cs.AI

Abstract: The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public. To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users' preferences over generated images. We leverage this dataset to train a CLIP-based scoring function, PickScore, which exhibits superhuman performance on the task of predicting human preferences. Then, we test PickScore's ability to perform model evaluation and observe that it correlates better with human rankings than other automatic evaluation metrics. Therefore, we recommend using PickScore for evaluating future text-to-image generation models, and using Pick-a-Pic prompts as a more relevant dataset than MS-COCO. Finally, we demonstrate how PickScore can enhance existing text-to-image models via ranking.

An Analysis of the "Pick-a-Pic" Dataset and the PickScore Scoring Function for Text-to-Image Generation

The paper "Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation" presents a comprehensive approach to enhancing the field of text-to-image generation through the collection of a large-scale dataset of user preferences and the development of a superior scoring function, PickScore. This research addresses a significant gap in the availability of large, publicly accessible datasets that model genuine user preferences in text-to-image tasks and proposes a novel framework for generating and evaluating model outputs.

The authors introduce a web application that facilitates the generation of images from user-provided text prompts. Crucially, this application captures user preferences, thereby enabling the accumulation of a dataset that truly reflects user interests. This dataset, named Pick-a-Pic, comprises over a million examples, each consisting of a text prompt and a pair of generated images, with user annotations indicating image preference or ties. This stands in contrast to datasets built from crowd-sourcing platforms, where annotators often lack intrinsic motivation or personal investment in the prompts they evaluate.

The dataset serves as a basis for training PickScore, a CLIP-based scoring function with the primary objective of predicting user preferences effectively. PickScore is trained using a preference prediction objective akin to InstructGPT's reward model, which optimizes the probability that a preferred image is selected over a less preferred one. The results indicate an accuracy rate of 70.5% in predicting user preferences, surpassing the 68.0% rate achieved by human annotators. This finding underscores the utility of employing real-world user-generated data for training models that exceed human-level performance in specific domains.

The paper further discusses the implications of using PickScore as an evaluation metric for text-to-image generation models. In comparing PickScore with the traditionally used FID metric and other aesthetics-based predictors, the authors demonstrate that PickScore correlates more closely with human judgment. This suggests that Pick- a-Pic's prompts better capture the nuances of user interests and expectations, providing a valuable standard for assessing the performance of text-to-image algorithms.

In addition to evaluation, PickScore is also leveraged for improving output quality through image ranking. By selecting the image with the highest PickScore from a selection of generated images, the authors show that the selected outputs are preferred by users over those chosen by other scoring systems. This technique highlights the utility of PickScore as a mechanism for enhancing the quality of generated images in practice.

This research has wide-reaching implications for the future development of text-to-image systems. Beyond the immediate benefits in evaluation and generation quality, the methodology outlined in this work could inform other domains where modeling genuine user preferences is pivotal. Future work could explore the application of reinforcement learning with human feedback (RLHF) techniques to further align model outputs with intricate user requirements.

In conclusion, the Pick-a-Pic dataset and PickScore underscore a shift towards more robust, preference-aware frameworks in generative AI, underlining the importance of authentic user interaction data. Through open access to these resources, the research community is equipped to advance state-of-the-art text-to-image models, aligning their outputs more closely with human expectations and preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862, 2022.
  2. Microsoft coco captions: Data collection and evaluation server. ArXiv, abs/1504.00325, 2015.
  3. Deep reinforcement learning from human preferences. ArXiv, abs/1706.03741, 2017.
  4. Arpad E. Elo. The rating of chessplayers, past and present. 1978.
  5. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  6. Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
  7. Openclip, July 2021. If you use this software, please cite it as below.
  8. Aligning text-to-image models using human feedback. ArXiv, abs/2302.12192, 2023.
  9. Microsoft coco: Common objects in context. In ECCV, 2014.
  10. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022.
  11. Simulacra aesthetic captions. Technical Report Version 1.0, Stability AI, 2022.  url https://github.com/JD-P/simulacra-aesthetic-captions .
  12. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  13. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  14. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  15. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2015.
  16. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. ArXiv, abs/2210.14896, 2022.
  17. Better aligning text-to-image models with human preference. ArXiv, abs/2303.14420, 2023.
  18. Imagereward: Learning and evaluating human preferences for text-to-image generation. ArXiv, abs/2304.05977, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuval Kirstain (10 papers)
  2. Adam Polyak (29 papers)
  3. Uriel Singer (20 papers)
  4. Shahbuland Matiana (4 papers)
  5. Joe Penna (2 papers)
  6. Omer Levy (70 papers)
Citations (234)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com