StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation (2409.12576v1)

Published 19 Sep 2024 in cs.CV

Abstract: Tuning-free personalized image generation methods have achieved significant success in maintaining facial consistency, i.e., identities, even with multiple characters. However, the lack of holistic consistency in scenes with multiple characters hampers these methods' ability to create a cohesive narrative. In this paper, we introduce StoryMaker, a personalization solution that preserves not only facial consistency but also clothing, hairstyles, and body consistency, thus facilitating the creation of a story through a series of images. StoryMaker incorporates conditions based on face identities and cropped character images, which include clothing, hairstyles, and bodies. Specifically, we integrate the facial identity information with the cropped character images using the Positional-aware Perceiver Resampler (PPR) to obtain distinct character features. To prevent intermingling of multiple characters and the background, we separately constrain the cross-attention impact regions of different characters and the background using MSE loss with segmentation masks. Additionally, we train the generation network conditioned on poses to promote decoupling from poses. A LoRA is also employed to enhance fidelity and quality. Experiments underscore the effectiveness of our approach. StoryMaker supports numerous applications and is compatible with other societal plug-ins. Our source codes and model weights are available at https://github.com/RedAIGC/StoryMaker.

Authors (5)

Zhengguang Zhou (8 papers)
Jing Li (621 papers)
Huaxia Li (17 papers)
Nemo Chen (5 papers)
Xu Tang (48 papers)

Citations (2)

View on Semantic Scholar

Summary

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

The paper "StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation" presents an innovative approach to enhancing personalization in text-to-image generation models. Traditional tuning-free personalized image generation methods have predominantly focused on maintaining facial consistency, often at the expense of other essential aspects such as clothing and hairstyles. StoryMaker addresses this limitation by preserving a holistic consistency across facial identities, clothing, hairstyles, and body structures, thereby enabling more coherent narrative generation.

Methodology

StoryMaker utilizes several advanced techniques to achieve its goals:

Reference Information Extraction: The method incorporates Arcface to extract facial identities and the CLIP vision encoder to capture details of clothing, hairstyles, and bodies from reference images. This dual extraction approach ensures that all relevant character features are well-preserved.
Positional-aware Perceiver Resampler (PPR): The PPR module plays a critical role in refining the extracted information. It combines facial embeddings and character embeddings while introducing positional embeddings and a learnable background embedding to distinguish between different characters and the background.
Decoupled Cross-attention: The extracted and refined embeddings are injected into the text-to-image model using a decoupled cross-attention mechanism, which helps integrate these details without extensive re-training.
Pose Decoupling: To enhance pose diversity, the model is trained with poses conditioned on ControlNet. This ensures that the generated characters can adopt various poses appropriate to the narrative context provided by text prompts.
Training with LoRA: To further refine the details and enhance fidelity, the model employs Low-Rank Adaptation (LoRA) in training the cross-attention modules, ensuring high-quality outputs even in complex scenes.
Loss Constraints on Cross-attention Maps: An innovative use of Mean Squared Error (MSE) loss with segmentation masks helps to regulate the cross-attention regions, thereby preventing the intermingling of character and background features.

Experimental Results

The experiments conducted demonstrate the efficacy of StoryMaker in preserving holistic consistency across various elements of the characters. It significantly outperforms existing models like InstantID, IP-Adapter-FaceID, MM-Diff, and PhotoMaker-V2 in terms of CLIP image similarity. Although the text adherence (measured by CLIP-T) is slightly compromised, the overall fidelity and character consistency are markedly superior.

Applications

StoryMaker's versatility is highlighted through several applications:

Single and Multiple-Character Image Generation: The model successfully generates both single and two-character scenes, maintaining consistency in faces, clothing, and poses.
Story Creation: Enables the generation of sequential images that form a coherent narrative based on text prompts.
Clothing and Pose Variations: The method supports interesting downstream applications like clothing swapping and human character variation, showcasing its adaptability.

Implications

The innovations introduced by StoryMaker have significant implications for both practical applications and future research. Practically, it holds potential for enhancing digital storytelling, creating coherent visual content in comic creation, and other narrative-based applications. Theoretically, it sets a new benchmark for holistic character consistency in text-to-image generation models, paving the path for future advancements in this domain.

Conclusion and Future Work

While StoryMaker has achieved impressive results, there are areas for improvement. The model occasionally struggles with pose anomalies in characters, and the generation of images involving three or more characters remains challenging. Further research could focus on enhancing pose accuracy and extending the model's capabilities to more complex scenes involving multiple characters.

The development of StoryMaker marks a significant step forward in text-to-image generation by emphasizing holistic consistency in character features. This approach not only enhances the narrative potential of generated images but also unlocks new possibilities for applications where character individuality and coherence are paramount. Future advancements building upon this work could further refine its capabilities and broaden its application scope.

PDF Markdown

Related Papers

GitHub

GitHub - RedAIGC/StoryMaker: StoryMaker: Towards consistent characters in text-to-image generation (449 stars)

Tweets

https://twitter.com/dreamingtulpa/status/1840787706986586194

https://twitter.com/taziku_co/status/1837860920313401847

https://twitter.com/KB2121212/status/1837907217967730985

https://twitter.com/arXivGPT/status/1837962778327326981

https://twitter.com/javaeeeee1/status/1837893420376900017

https://twitter.com/arXivGPT/status/1837600439107277182

YouTube

Show All Videos