Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

377

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation (2403.16990v1)

Published 25 Mar 2024 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.

PDF HTML Abstract

Bounded Attention for Multi-Subject Text-to-Image Generation

The paper proposes a novel approach to address the challenges faced by existing text-to-image diffusion models in generating scenes with multiple, semantically or visually similar subjects. The authors identify a phenomenon termed "semantic leakage," wherein attention layers in the diffusion models inadvertently blend features between distinct subjects during the denoising process. This blending interferes with the model's ability to generate images that faithfully represent given complex prompts.

Methodology

The central contribution of the paper is the introduction of "Bounded Attention," a training-free method aimed at constraining the information flow in these generative models. Bounded Attention operates by modifying the attention computation to mitigate feature leakage, thereby enabling better control over the individuality of each subject in the generated image.

The approach is divided into two phases:

Bounded Guidance: During the initial denoising steps, a guidance loss is applied that steers cross- and self-attention maps to align with the intended subject layouts. The loss is strategically designed to guide the latent representation towards an accurate positioning of subjects without aggressive mask constraints.
Bounded Denoising: Throughout the entire denoising process, subject-specific attention masks are applied to both cross- and self-attention layers, preventing unwanted information leakage between subjects while still allowing interaction with the background to maintain image consistency.

The method is validated on both Stable Diffusion and SDXL diffusion models, showcasing its effectiveness compared to existing layout-guided generation methods.

Results and Implications

Experiments demonstrate that Bounded Attention significantly reduces semantic leakage, allowing for accurate generation of multiple subjects with distinct attributes even in scenarios where subjects share visual similarity. This is achieved without any retraining or fine-tuning, offering an efficient solution applicable to pre-existing models.

In terms of quantitative performance, the approach exhibits strong results on tasks involving complex prompt-based generation, outperforming state-of-the-art methods in both trained and training-free categories.

Practically, this technique enhances user control in applications demanding precise image synthesis from textual descriptions. Theoretically, it opens new avenues for research into attention mechanisms and their role in multi-subject generative tasks.

Future Directions

The paper lays the groundwork for further exploration into automatic seed generation aligned with complex prompts and investigating more advanced segmentation techniques during the denoising stages. Additionally, the method may be extended to other generative frameworks that rely heavily on attention mechanisms, contributing to a broader understanding of feature alignment in high-fidelity image synthesis.

In summary, Bounded Attention provides a robust framework for improving multi-subject text-to-image generation by addressing intrinsic architectural biases in diffusion models. This work not only advances the practical capabilities of generative models but also deepens the theoretical understanding of their operational dynamics.

PDF Markdown Bookmark Chat (Pro)

References (40)

Authors (4)

Omer Dahary (4 papers)
Or Patashnik (32 papers)
Kfir Aberman (46 papers)
Daniel Cohen-Or (172 papers)

Citations (13)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1772474391592407292

https://twitter.com/_vztu/status/1810109657891107059

https://twitter.com/OutofAi/status/1840879243242197423

https://twitter.com/taziku_co/status/1809492599067275567

https://twitter.com/arxivsanitybot/status/1773340861759115593

https://twitter.com/javaeeeee1/status/1772589444358369551