MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation (2404.11565v2)

Published 17 Apr 2024 in cs.CV, cs.AI, and cs.GR

Abstract: We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in LLMs, MoA distributes the generation workload between two attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable. Project page: https://snap-research.github.io/mixture-of-attention

PDF Abstract

Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

Overview

In the paper "MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation," the authors introduce a novel architecture, Mixture-of-Attention (MoA), designed for personalizing text-to-image diffusion models. This architecture is inspired by the Mixture-of-Experts mechanism found in LLMs. MoA utilizes dual attention pathways to separately handle personalized and non-personalized content, aiming to enhance subject-context disentanglement without compromising the diversity and richness of generated images.

Architectural Design

MoA incorporates two key components—the personalized branch and the non-personalized prior branch—within its architecture. The prior branch utilizes fixed attention layers from the pre-existing model, maintaining the model's original capabilities. In contrast, the personalized branch is designed to adaptively learn and insert subject-specific features, guided by an innovative routing mechanism:

Personalized Branch: Learns to embed personalized features driven by subject imagery.
Non-Personalized Prior Branch: Retains the original functionality of the foundational model, ensuring diversity and richness are preserved.
Routing Mechanism: Optimally distributes pixels to either the personalized or prior branch, acting dynamically based on the input context and subject matter.

Key Findings and Contributions

The paper meticulously demonstrates how MoA successfully disentangles subject content from contextual elements within generated images, allowing for high fidelity in personalized content creation. Key contributions and findings include:

Prior Preservation: MoA effectively preserves the generative capabilities of the base model, enabling it to produce diverse and contextually rich images without retraining or extensive fine-tuning.
Enhanced Subject-Context Disentanglement: Through its dual-pathway architecture, MoA provides precise control over the integration of subject and context, producing natural-looking images where personalized subjects interact seamlessly with generated environments.
Compatibility and Extensibility: The architecture allows for integration with existing diffusion model enhancements like ControlNet for pose adjustments, and can be adapted for applications such as subject morphing and real-image editing.

Implications and Future Directions

From both theoretical and practical perspectives, the MoA architecture introduces significant advancements in personalized image generation. Theoretically, it demonstrates a viable approach to enhancing subject-context disentanglement without compromising the foundational model's intrinsic capabilities. Practically, it opens up new possibilities for creating personalized digital media with applications ranging from personalized advertising to virtual reality environments.

The research suggests potential future developments in AI such as:

Expanding to Other Media Types: Extending MoA's approach to video, 3D, and 4D models could revolutionize personalized content generation across a broader spectrum of media.
Enhanced Multi-subject Interactions: Further refinement could lead to improved handling of scenes with multiple interacting subjects, enhancing the model's utility for complex image generation tasks.
Integration with Larger Models: Applying MoA to larger, more capable foundational models could yield even more impressive results in terms of image diversity and quality.

Conclusion

MoA represents an important step forward in the personalization of generative models, balancing the retention of pre-existing capabilities with the introduction of novel, subject-specific enhancements. This technology holds significant promise for the future of personalized digital media creation, offering a sophisticated toolset for users seeking to generate highly customized content automatically.