Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation (2403.16990v1)

Published 25 Mar 2024 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.

Bounded Attention for Multi-Subject Text-to-Image Generation

The paper proposes a novel approach to address the challenges faced by existing text-to-image diffusion models in generating scenes with multiple, semantically or visually similar subjects. The authors identify a phenomenon termed "semantic leakage," wherein attention layers in the diffusion models inadvertently blend features between distinct subjects during the denoising process. This blending interferes with the model's ability to generate images that faithfully represent given complex prompts.

Methodology

The central contribution of the paper is the introduction of "Bounded Attention," a training-free method aimed at constraining the information flow in these generative models. Bounded Attention operates by modifying the attention computation to mitigate feature leakage, thereby enabling better control over the individuality of each subject in the generated image.

The approach is divided into two phases:

  1. Bounded Guidance: During the initial denoising steps, a guidance loss is applied that steers cross- and self-attention maps to align with the intended subject layouts. The loss is strategically designed to guide the latent representation towards an accurate positioning of subjects without aggressive mask constraints.
  2. Bounded Denoising: Throughout the entire denoising process, subject-specific attention masks are applied to both cross- and self-attention layers, preventing unwanted information leakage between subjects while still allowing interaction with the background to maintain image consistency.

The method is validated on both Stable Diffusion and SDXL diffusion models, showcasing its effectiveness compared to existing layout-guided generation methods.

Results and Implications

Experiments demonstrate that Bounded Attention significantly reduces semantic leakage, allowing for accurate generation of multiple subjects with distinct attributes even in scenarios where subjects share visual similarity. This is achieved without any retraining or fine-tuning, offering an efficient solution applicable to pre-existing models.

In terms of quantitative performance, the approach exhibits strong results on tasks involving complex prompt-based generation, outperforming state-of-the-art methods in both trained and training-free categories.

Practically, this technique enhances user control in applications demanding precise image synthesis from textual descriptions. Theoretically, it opens new avenues for research into attention mechanisms and their role in multi-subject generative tasks.

Future Directions

The paper lays the groundwork for further exploration into automatic seed generation aligned with complex prompts and investigating more advanced segmentation techniques during the denoising stages. Additionally, the method may be extended to other generative frameworks that rely heavily on attention mechanisms, contributing to a broader understanding of feature alignment in high-fidelity image synthesis.

In summary, Bounded Attention provides a robust framework for improving multi-subject text-to-image generation by addressing intrinsic architectural biases in diffusion models. This work not only advances the practical capabilities of generative models but also deepens the theoretical understanding of their operational dynamics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Cross-image attention for zero-shot appearance transfer. arXiv preprint arXiv:2311.03335, 2023.
  2. Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18370–18380, 2023.
  3. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  4. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023.
  5. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  6. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  7. Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023.
  8. Zero-shot spatial layout conditioning for text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2174–2183, 2023.
  9. Yuki Endo. Masked-attention diffusion guidance for spatially controlling text-to-image generation. arXiv preprint arXiv:2308.06027, 2023.
  10. Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems, 36, 2024.
  11. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
  12. Localized text-to-image generation for free via cross attention control. arXiv preprint arXiv:2306.14636, 2023.
  13. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  14. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  15. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  16. Dense text-to-image generation with attention modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7701–7711, 2023.
  17. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
  18. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
  19. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  20. Teaching clip to count to ten. arXiv preprint arXiv:2302.12066, 2023.
  21. Localizing object-level shape variations with text-to-image diffusion models. arXiv preprint arXiv:2303.11306, 2023.
  22. Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427, 2023.
  23. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  24. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. arXiv preprint arXiv:2308.05095, 2023.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  26. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  27. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. arXiv preprint arXiv:2306.08877, 2023.
  28. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  29. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  30. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  31. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  32. Multi-concept t2i-zero: Tweaking only the text embeddings and nothing else. arXiv preprint arXiv:2310.07419, 2023.
  33. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  34. Schr\\\backslash\”{{\{{o}}\}} dinger’s bat: Diffusion models sometimes generate polysemous words in superposition. arXiv preprint arXiv:2211.13095, 2022.
  35. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  36. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023.
  37. Reco: Region-controlled text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14246–14255, 2023.
  38. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  39. Loco: Locally constrained training-free layout-to-image synthesis. arXiv preprint arXiv:2311.12342, 2023.
  40. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22490–22499, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Omer Dahary (4 papers)
  2. Or Patashnik (32 papers)
  3. Kfir Aberman (46 papers)
  4. Daniel Cohen-Or (172 papers)
Citations (13)