Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models (2312.06712v2)

Published 10 Dec 2023 in cs.CV and cs.AI

Abstract: Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Chatgpt. https://openai.com/blog/chatgpt.
  2. A-star: Test-time attention segregation and retention for text-to-image synthesis. arXiv preprint arXiv:2306.14544, 2023.
  3. Uniqueness of the gaussian kernel for scale-space filtering. PAMI, 1986.
  4. Paint by word. arXiv preprint arXiv:2103.10951, 2021.
  5. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  6. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023.
  7. Imagenet: A large-scale hierarchical image database. In 2009 CVPR, 2009.
  8. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 1982.
  9. Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
  10. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
  11. Generative adversarial nets. NeurIPS, 2014.
  12. Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
  13. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  14. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  15. Denoising diffusion probabilistic models. NeurIPS, 2020.
  16. Improving sample quality of diffusion models using self-attention guidance. arXiv preprint arXiv:2210.00939, 2022.
  17. Reversion: Diffusion-based relation inversion from images. arXiv preprint arXiv:2303.13495, 2023.
  18. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  19. Multi-concept customization of text-to-image diffusion. arXiv preprint arXiv:2212.04488, 2022.
  20. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  21. Divide & bind your attention for improved generative semantic nursing. arXiv preprint arXiv:2307.10864, 2023.
  22. Compositional visual generation with composable diffusion models. In ECCV, 2022.
  23. Directed diffusion: Direct control of object placement through attention guidance. arXiv preprint arXiv:2302.13153, 2023.
  24. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  25. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
  26. Scale-space and edge detection using anisotropic diffusion. PAMI, 1990.
  27. Learning transferable visual models from natural language supervision. In ICML, 2021.
  28. Zero-shot text-to-image generation. In ICML, 2021.
  29. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  30. Generative adversarial text to image synthesis. In ICML, 2016.
  31. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  32. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  33. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  34. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  35. Divide, evaluate, and refine: Evaluating and improving text-to-image alignment with iterative vqa feedback. arXiv preprint arXiv:2307.04749, 2023.
  36. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  37. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  38. Michael Tomasello. The cultural origins of human cognition. Harvard university press, 2009.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  40. Attention is all you need. NeurIPS, 2017.
  41. Compositional text-to-image synthesis with attention map control of diffusion models. arXiv preprint arXiv:2305.13921, 2023.
  42. Semantic image synthesis via diffusion models. arXiv preprint arXiv:2207.00050, 2022.
  43. Joachim Weickert et al. Anisotropic diffusion in image processing. Teubner Stuttgart, 1998.
  44. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  45. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  46. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  47. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  48. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In CVPR, 2023.
  49. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
  50. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhipeng Bao (13 papers)
  2. Yijun Li (56 papers)
  3. Krishna Kumar Singh (46 papers)
  4. Yu-Xiong Wang (87 papers)
  5. Martial Hebert (72 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com