Style Aligned Image Generation via Shared Attention (2312.02133v2)

Published 4 Dec 2023 in cs.CV, cs.GR, and cs.LG

Abstract: Large-scale Text-to-Image (T2I) models have rapidly gained prominence across creative fields, generating visually compelling outputs from textual prompts. However, controlling these models to ensure consistent style remains challenging, with existing methods necessitating fine-tuning and manual intervention to disentangle content and style. In this paper, we introduce StyleAligned, a novel technique designed to establish style alignment among a series of generated images. By employing minimal `attention sharing' during the diffusion process, our method maintains style consistency across images within T2I models. This approach allows for the creation of style-consistent images using a reference style through a straightforward inversion operation. Our method's evaluation across diverse styles and text prompts demonstrates high-quality synthesis and fidelity, underscoring its efficacy in achieving consistent style across various inputs.

Authors (4)

Amir Hertz (21 papers)
Andrey Voynov (15 papers)
Shlomi Fruchter (8 papers)
Daniel Cohen-Or (172 papers)

Citations (73)

View on Semantic Scholar

Summary

Overview of StyleAligned: Style Aligned Image Generation via Shared Attention

The paper "Style Aligned Image Generation via Shared Attention" introduces a novel technique named StyleAligned, which addresses the deficiency in existing large-scale Text-to-Image (T2I) generative models to consistently maintain style across a set of generated images. T2I models have become prominent tools in creative fields such as art and graphic design, aiming to produce visually compelling images aligned with textual prompts. Yet, these models often fail to consistently interpret the same stylistic descriptor across multiple images, necessitating time-consuming manual interventions and fine-tuning for precision in style control.

StyleAligned offers a solution to this challenge by employing minimal 'attention sharing' during the diffusion process of these models. This approach facilitates style consistency across varied inputs without the need for optimization or fine-tuning. By enabling attention sharing among self-attention layers, the method synchronizes the style interpretation among the generated images set, providing high-quality synthesis that adheres faithfully to text prompts and reference styles.

Methodology

The core innovation lies in modifying self-attention layers within diffusion models—a technique inspired by recent advancements in attention-based image generation and editing. The method leverages shared self-attention, where image features from different generated images are updated in reference to a shared style descriptor or reference style image. Importantly, StyleAligned uses adaptive instance normalization (AdaIN) to normalize queries and keys of target images against a reference image to achieve balanced attention flow, ensuring semantically meaningful style transfer.

Empirical Evaluation

The paper provides a comprehensive evaluation across diverse styles and text prompts, demonstrating the efficacy of StyleAligned in generating high-fidelity style-consistent image sets. Quantitative metrics such as CLIP similarity for text alignment and DINO embedding for style consistency underscore the method's superior performance over existing personalization techniques like DreamBooth and StyleDrop, which struggle to disentangle style from content effectively.

Implications and Future Work

Practically, StyleAligned holds potential implications for fields requiring consistent style generation, such as animation and game design, where stylistic coherence across various visual elements is crucial. Theoretically, the method contributes to the ongoing research in attention mechanisms within generative models, offering insights into attention-based style transfer without optimization.

Future developments may explore scalability aspects, such as style alignment across larger image sets or more nuanced style control. Additionally, further research could investigate overcoming limitations in current inversion techniques to better support style alignment from existing reference images, paving the way for training style-conditioned models.

Conclusion

StyleAligned presents a meaningful advancement in the pursuit of style consistency in T2I generative models. By circumventing optimization processes, it offers a scalable, zero-shot solution that enhances the potential for personalized and stylistically coherent image generation. Both the practical applications and theoretical insights of this approach make it a valuable contribution to the generative modeling community.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos