StableIdentity: Inserting Anybody into Anywhere at First Sight (2401.15975v1)

Published 29 Jan 2024 in cs.CV

Abstract: Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.

Authors (7)

Qinghe Wang (6 papers)
Xu Jia (57 papers)
Xiaomin Li (27 papers)
Taiqing Li (2 papers)
Liqian Ma (31 papers)
Yunzhi Zhuge (17 papers)
Huchuan Lu (199 papers)

Citations (14)

View on Semantic Scholar

Summary

The paper presents a novel method that integrates a face encoder with editable identity priors to preserve and transfer a subject's identity from a single image.
It employs a masked two-phase diffusion loss that stabilizes pixel-level details while ensuring adaptable customization across varied contexts.
Experimental results demonstrate superior identity consistency and scalability for applications like personalized portraits, virtual try-ons, videos, and 3D models.

Overview

StableIdentity introduces a novel approach for inserting a target subject's identity—taken from a single image—into diverse contexts guided by textual descriptions. This paper introduces a method that not only preserves identity attributes with remarkable consistency but also offers flexible editability across various applications like personalized portraits, virtual try-ons, and art & design.

Methodology

At the core of StableIdentity lies a face encoder integrated with an identity prior. The face encoder is pretrained to recognize facial features effectively, and this capability is utilized to encode the identity of an input face image. The innovation extends towards leveraging an editable prior constructed from celebrity names. These names, readily available in extensive text-to-image model datasets, come with a rich prior that ensures the learned identity is consistent across different contexts. The authors effectively integrate this identity prior and editability prior into a single model to address previous limitations of identity preservation and flexibility in customization.

The approach is further augmented by a masked two-phase diffusion loss. This loss function is designed to optimize the generative model's ability to reconstruct and stabilize the identity across a plethora of generated contexts. It ensures that the pixel-level details of the face remain precise and that the diversity in generation does not compromise the inherent identity features.

Experimental Results

Extensive experiments showcase Superior performance over previous customization methods, with an effective and prominent ability to maintain identity consistency. The method is adept at combining with existing image-level modules and unlocks the generalization ability to inject learned identity from a single image into video or 3D generation without further fine-tuning.

Implications and Future Directions

The significance of such a framework is manifold. The capability to combine identity priors and editability into a unified architecture is a remarkable stride in the field of human-centric generation. It is not just the preservation of identity or the fidelity of the output that is laudable but the efficiency with which these results are achieved. StableIdentity's ability to extend identity-driven customization to video and 3D models without the need for elaborate fine-tuning demonstrates a potential paradigm shift in how personalized content can be generated.

The implications of this technology extend to various domains, from entertainment and personal digital content creation to potential applications in virtual reality and AI-driven avatar creation. Moving forward, this approach could transform the nexus between personalized digital identity and a multitude of virtual platforms, making identity a flexible, yet stable construct, adaptable to contexts limited only by textual creativity.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1752196040621494554

https://twitter.com/Gradio/status/1752327311649820812

https://twitter.com/taziku_co/status/1752551209121263668

https://twitter.com/javaeeeee1/status/1752300993511842167