Background
Advancements in generative AI, particularly large-scale text-to-image (T2I) diffusion models, have seen remarkable development, allowing the creation of imaginative scenes from textual descriptions. Despite their creative potential, ensuring consistent portrayal of subjects across varying prompts poses a significant challenge. The traditional methods, such as fine-tuning or image conditioning, typically demand substantial computational resources and may not offer multi-character consistency without trade-offs between consistency and prompt alignment.
Introducing ConsiStory
In the paper under discussion, the authors present "ConsiStory," a training-free method that facilitates the generation of visually consistent subjects across multiple prompts without the need for optimization or pre-training. By exploiting internal feature representations shared during diffusion based image generation, ConsiStory achieves cross-frame consistency a priori, that is, during the generative process rather than being imposed post hoc.
Technical Approach
The authors describe a technique that hinges upon subject-driven shared self-attention and correspondence-based feature injection. This approach, different from prior ones that relied on personalization or encoder-based tools, manipulates the diffusion model’s internal activations to align the generated images with other images rather than with an external source image. The process entails:
- Localizing the subject across a range of noise-influenced generated images.
- Enabling generated images to attend to other images' subject patches, facilitating subject consistency.
- Implementing self-attention dropout and query-feature blending to enrich layout diversity.
- Injecting features across images to enhance detailed consistency.
This allows for real-time generation that is roughly twenty times faster than the current state-of-the-art methods. Additionally, ConsiStory is capable of extending to multi-subject scenarios, providing a significant advantage over other methods that falter in these complex situations.
Performance Evaluation
ConsiStory was empirically compared to several baselines, displaying superior performance in subject consistency and prompt alignment without requiring costly training or backpropagation phases. The authors provide insights into a comprehensive series of evaluations:
- Qualitative Assessments: Visual comparisons showcase that the proposed method preserves subject consistency while adhering to the prompts with remarkable finesse.
- Quantitative Measurements: Employing both CLIP scores for prompt-alignment and DreamSim scores for consistency, alongside a detailed user paper, ConsiStory demonstrates its efficacy.
Practical Implications and Extensions
Several practical applications have been highlighted, such as compatibility with spatial control tools like ControlNet and the ability to perform training-free personalization for common objects. Although the technique thrives in many scenarios, it may face limitations with unusual styles or entail dependencies on certain model features for subject localization.
Conclusion
ConsiStory represents a significant leap in the generative model landscape, offering a swift and efficient alternative to the previous personalized text-to-image generation methods. With its innovative feature alignment strategies and emphasis on consistency, this model stands out as a meaningful tool for creators seeking to tell cohesive visual stories without extensive computational overheads.