Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization (2404.04650v1)

Published 6 Apr 2024 in cs.CV

Abstract: Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at https://github.com/xiefan-guo/initno.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. A-star: Test-time attention segregation and retention for text-to-image synthesis. In ICCV, 2023.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. Language models are few-shot learners. In NeurIPS, 2020.
  4. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In ICCV, 2023.
  5. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  6. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In SIGGRAPH, 2023.
  7. Generative pretraining from pixels. In ICML, 2020.
  8. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  9. Cogview: Mastering text-to-image generation via transformers. In NeurIPS, 2021.
  10. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  11. Dreamoving: A human dance video generation framework based on diffusion models. arXiv preprint arXiv:2312.05107, 2023a.
  12. Training-free structured diffusion guidance for compositional text-to-image synthesis. In ICLR, 2023b.
  13. Generative adversarial nets. In NeurIPS, 2014.
  14. Matryoshka diffusion models. arXiv preprint arXiv:2310.15111, 2023.
  15. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023.
  16. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  17. Scaling up gans for text-to-image synthesis. In CVPR, 2023.
  18. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  19. Training generative adversarial networks with limited data. In NeurIPS, 2020a.
  20. Analyzing and improving the image quality of stylegan. In CVPR, 2020b.
  21. Alias-free generative adversarial networks. In NeurIPS, 2021.
  22. Auto-encoding variational bayes. In ICLR, 2014.
  23. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  24. Divide & bind your attention for improved generative semantic nursing. In BMVC, 2023.
  25. Compositional visual generation with composable diffusion models. In ECCV, 2022.
  26. Zero-shot image-to-image translation. In SIGGRAPH, 2023.
  27. Learning transferable visual models from natural language supervision. In ICML, 2021.
  28. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  29. Zero-shot text-to-image generation. In ICML, 2021.
  30. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  31. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  32. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  33. A picture is worth a thousand words: Principled recaptioning improves image generation. arXiv preprint arXiv:2310.16656, 2023.
  34. Df-gan: A simple and effective baseline for text-to-image synthesis. In CVPR, 2022.
  35. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
  36. Conditional image generation with pixelcnn decoders. In NeurIPS, 2016.
  37. Attention is all you need. In NeurIPS, 2017.
  38. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV, 2023.
  39. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  40. Raphael: Text-to-image generation via large mixture of diffusion paths. In NeurIPS, 2023.
  41. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022.
  42. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  43. Cross-modal contrastive learning for text-to-image generation. In CVPR, 2021.
  44. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, 2019.
Citations (15)

Summary

We haven't generated a summary for this paper yet.