Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation (2401.05675v2)

Published 11 Jan 2024 in cs.CV

Abstract: Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.

Introduction

The field of text-to-image (T2I) generation has seen a remarkable evolution thanks to diffusion models and pre-trained text encoders, resulting in the ability to generate images from textual descriptions. Despite these advances, the challenge of creating images that align well with multiple quality criteria—such as aesthetic appeal, adherence to human preferences, and emotional resonance—remains. Addressing this issue, researchers have introduced Parrot, a multi-reward reinforcement learning (RL) framework that optimizes the T2I process using Pareto-optimal selection to balance various image quality rewards effectively.

Fine-tuning T2I Models with Multiple Rewards

Past methods have explored the use of RL to refine T2I models, achieving quality improvements by using individual quality metrics as reward functions. However, optimizing for multiple quality metrics often required manual tuning of reward weights, which is impractical. Parrot, on the other hand, autonomously determines the optimal trade-offs among various rewards. By focusing on the Pareto-optimal set—a selection of images within a training batch that embody the optimal balance among different objectives—the model jointly enhances image quality on several fronts.

Joint Optimization and Prompt-Centered Guidance

Parrot's approach goes a step further by simultaneously tuning the prompt expansion network (PEN) with the T2I model. This integrated optimization allows for better synergy between detailed text prompts and image generation, leading to higher quality outcomes. Furthermore, the framework addresses the risk of straying away from the original prompt by employing a prompt-centered guidance strategy during inference, ensuring generated images remain true to the user's original input.

Experimental Evaluation

Extensive testing and user studies illustrate that Parrot sets a new standard against various baselines. Compared to methods that do not involve prompt expansion or fine-tune only part of the generation process, Parrot shows marked improvements in text-image alignment, aesthetics, human preference, and sentiment. The user paper corroborates these findings, with Parrot outperforming the competition across all evaluated criteria.

Conclusion

Parrot's introduction is a significant step towards enhancing the quality of T2I generation. With its novel use of multi-reward RL and Pareto optimization, Parrot improves image quality on multiple fronts. Simultaneously, joint optimization and original prompt-centered guidance safeguard the relevance of the generated images to the original text prompts. As T2I technology continues to evolve, frameworks like Parrot pave the way for increasingly sophisticated digital image creation tools that cater to a variety of quality metrics.

Further Considerations

While the framework advances T2I generation, it's important to note that the quality and biases of the reward models it uses will influence its performance. As the field progresses, refinements in these reward metrics are anticipated to continually improve Parrot's output quality. Additionally, given the potential for misuse in generating inappropriate content, ethical considerations around the user's influence on T2I generation remain critical. As such, responsible development and deployment of such technology are paramount.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Amazon mechanical turk. https://www.mturk.com/, 2005.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Training diffusion models with reinforcement learning. In ICML Workshop, 2023.
  5. Muse: Text-to-image generation via masked generative transformers. In ICML, 2023.
  6. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023.
  7. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  9. Optimizing ddpm sampling with shortcut fine-tuning. In ICML, 2023.
  10. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. In NeurIPS, 2023.
  11. Perceptual quality assessment of smartphone photography. In CVPR, 2020.
  12. Svdiff: Compact parameter space for diffusion fine-tuning. In ICCV, 2023.
  13. Optimizing prompts for text-to-image generation. In CoRR, 2022.
  14. Classifier-free diffusion guidance. In CoRR, 2022.
  15. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  16. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. TIP, 29:4041–4056, 2020.
  17. Lora: Low-rank adaptation of large language models. 2022.
  18. The power of sound (tpos): Audio reactive video generation with stable diffusion. In ICCV, 2023.
  19. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
  20. Musiq: Multi-scale image quality transformer. In ICCV, 2021.
  21. Vila: Learning image aesthetics from user comments with vision-language pretraining. In CVPR, 2023.
  22. Adam: A method for stochastic optimization. In ICLR, 2015.
  23. Pick-a-pic: An open dataset of user preferences for text-to-image generation. 2023.
  24. Aligning text-to-image models using human feedback. In CoRR, 2023a.
  25. Soundini: Sound-guided diffusion for natural video editing. arXiv preprint arXiv:2304.06818, 2023b.
  26. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
  27. Pareto set learning for expensive multi-objective optimization. In NeurIPS, 2022.
  28. The steering approach for multi-criteria reinforcement learning. In NeurIPS, 2001.
  29. Kaisa Miettinen. Nonlinear multiobjective optimization. Springer Science & Business Media, 1999.
  30. Multi-objective deep reinforcement learning. In CoRR, 2016.
  31. Ava: A large-scale database for aesthetic visual analysis. In CVPR, 2012.
  32. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  33. Learning transferable visual models from natural language supervision. In ICML, 2021.
  34. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In NeurIPS, 2023.
  35. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  36. Conceptlab: Creative generation using diffusion prior constraints. arXiv preprint arXiv:2308.02669, 2023.
  37. A survey of multi-objective sequential decision-making. JAIR, 48:67–113, 2013.
  38. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  39. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  40. Palette: Image-to-image diffusion models. In SIGGRAPH, 2022a.
  41. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022b.
  42. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  43. The emotions of the crowd: Learning image sentiment from tweets via cross-modal distillation. ECAI, 2023.
  44. Christian Shelton. Balancing multiple sources of reward in reinforcement learning. In NeurIPS, 2000.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  46. Denoising diffusion implicit models. In ICLR, 2021.
  47. Managing power consumption and performance of computing systems using reinforcement learning. In NeurIPS, 2007.
  48. Maxvit: Multi-axis vision transformer. In ECCV, 2022.
  49. Multi-objective reinforcement learning using sets of pareto dominating policies. JMLR, 15(1):3483–3512, 2014.
  50. Attention is all you need. In NeurIPS, 2017.
  51. Imagereward: Learning and evaluating human preferences for text-to-image generation. In NeurIPS, 2023.
  52. From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality. In CVPR, 2020.
  53. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
  54. Magvit: Masked generative video transformer. In CVPR, 2023.
  55. Shifted diffusion for text-to-image generation. In CVPR, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Seung Hyun Lee (10 papers)
  2. Yinxiao Li (20 papers)
  3. Junjie Ke (13 papers)
  4. Innfarn Yoo (7 papers)
  5. Han Zhang (338 papers)
  6. Jiahui Yu (65 papers)
  7. Qifei Wang (18 papers)
  8. Fei Deng (35 papers)
  9. Glenn Entis (2 papers)
  10. Junfeng He (66 papers)
  11. Gang Li (579 papers)
  12. Sangpil Kim (34 papers)
  13. Irfan Essa (91 papers)
  14. Feng Yang (147 papers)
Citations (12)