Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training (2312.16204v4)

Published 23 Dec 2023 in cs.CV

Abstract: Diffusion models have shown impressive performance in many domains. However, the model's capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. In this work, we propose Iterative Prompt Relabeling (IPR), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback. IPR first samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations. With IPR, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods. Our code is publicly available at https://github.com/xinyan-cxy/IPR-RLDF.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Vqa: Visual question answering, 2016.
  2. Training diffusion models with reinforcement learning, 2023.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  4. Zero-shot semantic segmentation, 2019.
  5. Coco-stuff: Thing and stuff classes in context, 2018.
  6. End-to-end object detection with transformers, 2020.
  7. Mmdetection: Open mmlab detection toolbox and benchmark, 2019.
  8. Microsoft coco captions: Data collection and evaluation server, 2015.
  9. Diffusion models beat gans on image synthesis, 2021.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
  11. Benchmarking spatial relationships in text-to-image generation, 2023.
  12. Lvis: A dataset for large vocabulary instance segmentation, 2019.
  13. Denoising diffusion probabilistic models, 2020.
  14. Imagen video: High definition video generation with diffusion models, 2022a.
  15. Video diffusion models, 2022b.
  16. Lora: Low-rank adaptation of large language models, 2021.
  17. Planning with diffusion for flexible behavior synthesis, 2022.
  18. Reinforcement learning: A survey, 1996.
  19. Mdetr – modulated detection for end-to-end multi-modal understanding, 2021.
  20. Dall-e-bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 8(7):3956–3963, 2023.
  21. Unifying visual-semantic embeddings with multimodal neural language models, 2014.
  22. Diffwave: A versatile diffusion model for audio synthesis, 2021.
  23. Aligning text-to-image models using human feedback, 2023.
  24. Grounded language-image pre-training, 2022.
  25. Yuxi Li. Deep reinforcement learning: An overview, 2018.
  26. Few-shot object detection via classification refinement and distractor retreatment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15395–15403, 2021.
  27. Gligen: Open-set grounded text-to-image generation, 2023.
  28. Microsoft coco: Common objects in context, 2015.
  29. Focal loss for dense object detection, 2018.
  30. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  31. Compositional visual generation with composable diffusion models, 2023b.
  32. Improved denoising diffusion probabilistic models, 2021.
  33. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  34. Learning transferable visual models from natural language supervision, 2021.
  35. Zero-shot text-to-image generation, 2021.
  36. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
  37. High-resolution image synthesis with latent diffusion models, 2022.
  38. U-net: Convolutional networks for biomedical image segmentation, 2015.
  39. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023.
  40. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  41. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  42. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  43. Consistency models. 2023.
  44. Glipv2: Unifying localization and vision-language understanding, 2022a.
  45. Adding conditional control to text-to-image diffusion models, 2023a.
  46. Tempera: Test-time prompting via reinforcement learning, 2022b.
  47. The wisdom of hindsight makes language models better instruction followers, 2023b.
  48. Controllable text-to-image generation with gpt-4. arXiv preprint arXiv:2305.18583, 2023c.

Summary

We haven't generated a summary for this paper yet.