InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization (2404.04650v1)
Abstract: Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at https://github.com/xiefan-guo/initno.
- A-star: Test-time attention segregation and retention for text-to-image synthesis. In ICCV, 2023.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Language models are few-shot learners. In NeurIPS, 2020.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In ICCV, 2023.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In SIGGRAPH, 2023.
- Generative pretraining from pixels. In ICML, 2020.
- Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- Cogview: Mastering text-to-image generation via transformers. In NeurIPS, 2021.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Dreamoving: A human dance video generation framework based on diffusion models. arXiv preprint arXiv:2312.05107, 2023a.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. In ICLR, 2023b.
- Generative adversarial nets. In NeurIPS, 2014.
- Matryoshka diffusion models. arXiv preprint arXiv:2310.15111, 2023.
- Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Scaling up gans for text-to-image synthesis. In CVPR, 2023.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Training generative adversarial networks with limited data. In NeurIPS, 2020a.
- Analyzing and improving the image quality of stylegan. In CVPR, 2020b.
- Alias-free generative adversarial networks. In NeurIPS, 2021.
- Auto-encoding variational bayes. In ICLR, 2014.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Divide & bind your attention for improved generative semantic nursing. In BMVC, 2023.
- Compositional visual generation with composable diffusion models. In ECCV, 2022.
- Zero-shot image-to-image translation. In SIGGRAPH, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Zero-shot text-to-image generation. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- A picture is worth a thousand words: Principled recaptioning improves image generation. arXiv preprint arXiv:2310.16656, 2023.
- Df-gan: A simple and effective baseline for text-to-image synthesis. In CVPR, 2022.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
- Conditional image generation with pixelcnn decoders. In NeurIPS, 2016.
- Attention is all you need. In NeurIPS, 2017.
- Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV, 2023.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
- Raphael: Text-to-image generation via large mixture of diffusion paths. In NeurIPS, 2023.
- Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022.
- Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
- Cross-modal contrastive learning for text-to-image generation. In CVPR, 2021.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, 2019.