Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scalable Ranked Preference Optimization for Text-to-Image Generation (2410.18013v2)

Published 23 Oct 2024 in cs.CV

Abstract: Direct Preference Optimization (DPO) has emerged as a powerful approach to align text-to-image (T2I) models with human feedback. Unfortunately, successful application of DPO to T2I models requires a huge amount of resources to collect and label large-scale datasets, e.g., millions of generated paired images annotated with human preferences. In addition, these human preference datasets can get outdated quickly as the rapid improvements of T2I models lead to higher quality images. In this work, we investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. Specifically, the preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process, greatly improving the dataset collection efficiency. Moreover, we demonstrate that such datasets allow averaging predictions across multiple models and collecting ranked preferences as opposed to pairwise preferences. Furthermore, we introduce RankDPO to enhance DPO-based methods using the ranking feedback. Applying RankDPO on SDXL and SD3-Medium models with our synthetically generated preference dataset "Syn-Pic" improves both prompt-following (on benchmarks like T2I-Compbench, GenEval, and DPG-Bench) and visual quality (through user studies). This pipeline presents a practical and scalable solution to develop better preference datasets to enhance the performance of text-to-image models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
  2. Building normalizing flows with stochastic interpolants. In ICLR, 2023.
  3. A general theoretical paradigm to understand learning from human preferences. In AISTATS, 2024.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  6. All are worth words: A vit backbone for diffusion models. In CVPR, 2023.
  7. Improving image generation with better captions. OpenAI Technical Report, 2023.
  8. Training diffusion models with reinforcement learning. In ICLR, 2024.
  9. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 1952.
  10. Learning to rank with nonsmooth cost functions. NIPS, 2006.
  11. Getting it right: Improving spatial consistency in text-to-image models. In ECCV, 2024.
  12. Enhancing diffusion models with text-encoder reinforcement learning. In ECCV, 2024a.
  13. Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692, 2024b.
  14. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, 2024c.
  15. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
  16. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-image generation. In ICLR, 2024.
  17. Deep reinforcement learning from human preferences. NIPS, 2017.
  18. Directly fine-tuning diffusion models on differentiable rewards. In ICLR, 2024.
  19. Reward model ensembles help mitigate overoptimization. In ICLR, 2024.
  20. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  21. Prdp: Proximal reward difference prediction for large-scale reward finetuning of diffusion models. In CVPR, 2024.
  22. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. arXiv preprint arXiv:2409.08861, 2024.
  23. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
  24. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  25. Reno: Enhancing one-step text-to-image models through reward-based noise optimization. In NeurIPS, 2024.
  26. Reinforcement learning for fine-tuning text-to-image diffusion models. NeurIPS, 2023.
  27. Datacomp: In search of the next generation of multimodal datasets. NeurIPS, 2023.
  28. Geneval: An object-focused framework for evaluating text-to-image alignment. In NeurIPS, 2023.
  29. Policy shaping: Integrating human feedback with reinforcement learning. NIPS, 2013.
  30. Diffusion-rpo: Aligning diffusion models through relative preference optimization. arXiv preprint arXiv:2406.06382, 2024.
  31. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP, 2021.
  32. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  33. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024a.
  34. Margin-aware preference optimization for aligning diffusion models without reference. arXiv preprint arXiv:2406.06424, 2024b.
  35. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024.
  36. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. In NeurIPS, 2023.
  37. Elucidating optimal reward-diversity tradeoffs in text-to-image diffusion models. arXiv preprint arXiv:2409.06493, 2024.
  38. Comat: Aligning text-to-image diffusion model with image-to-text concept matching. In NeurIPS, 2024.
  39. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In NeurIPS, 2023.
  40. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023a.
  41. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023b.
  42. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In EMNLP, 2022a.
  43. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024a.
  44. Selma: Learning and merging skill-specific text-to-image experts with auto-generated data. In NeurIPS, 2024b.
  45. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022b.
  46. Aligning diffusion models by optimizing human utility. arXiv preprint arXiv:2404.04465, 2024c.
  47. Textcraftor: Your text encoder can be image quality controller. In CVPR, 2024d.
  48. Rich human feedback for text-to-image generation. In CVPR, 2024a.
  49. Step-aware preference optimization: Aligning preference with denoising performance at each step. arXiv preprint arXiv:2406.04314, 2024b.
  50. Fine-tuning diffusion models for enhancing face quality in text-to-image generation. arXiv preprint arXiv:2406.17100, 2024.
  51. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024.
  52. Flow matching for generative modeling. In ICLR, 2023.
  53. Alignment of diffusion models: Fundamentals, challenges, and future. arXiv preprint arXiv:2409.07253, 2024a.
  54. Visual instruction tuning. Advances in neural information processing systems, 2023.
  55. Llm4gen: Leveraging semantic representation of llms for text-to-image generation. arXiv preprint arXiv:2407.00737, 2024b.
  56. Character-aware models improve visual text rendering. arXiv preprint arXiv:2212.10562, 2022a.
  57. Lipo: Listwise preference optimization through learning-to-rank. arXiv preprint arXiv:2402.01878, 2024c.
  58. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022b.
  59. Exploring the role of large language models in prompt encoding for diffusion models. arXiv preprint arXiv:2406.11831, 2024.
  60. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024.
  61. Toward verifiable and reproducible human evaluation for text-to-image generation. In CVPR, 2023.
  62. Scalable diffusion models with transformers. In ICCV, 2023.
  63. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. In ICLR, 2024.
  64. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  65. Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739, 2023.
  66. Video diffusion alignment via reward gradients. arXiv preprint arXiv:2407.08737, 2024.
  67. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023.
  68. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  69. Generative adversarial text to image synthesis. In ICML, 2016.
  70. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  71. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  72. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  73. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  74. Preference ranking optimization for human alignment. In AAAI, 2024.
  75. Dreamsync: Aligning text-to-image generation with image understanding feedback. arXiv preprint arXiv:2311.17946, 2023.
  76. Diffusion model alignment using direct preference optimization. In CVPR, 2024.
  77. The lambdaloss framework for ranking metric optimization. In CIKM, 2018.
  78. A theoretical analysis of ndcg type ranking measures. In COLT, 2013.
  79. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992.
  80. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. In arXiv preprint arXiv:2312.17090, 2024a.
  81. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023a.
  82. Better aligning text-to-image models with human preference. In ICCV, 2023b.
  83. Conceptmix: A compositional image generation benchmark with controllable difficulty. arXiv preprint arXiv:2408.14339, 2024b.
  84. Multimodal large language model is a human-aligned annotator for text-to-image generation. arXiv preprint arXiv:2404.15100, 2024c.
  85. Imagereward: Learning and evaluating human preferences for text-to-image generation. In NeurIPS, 2023.
  86. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  87. Learning multi-dimensional human preference for text-to-image generation. In CVPR, 2024a.
  88. Large-scale reinforcement learning for diffusion models. In ECCV, 2024b.
  89. Simple multi-dataset detection. In CVPR, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.