Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models (2404.04243v3)

Published 5 Apr 2024 in cs.CV and cs.AI

Abstract: Text-to-image diffusion models have shown remarkable success in generating personalized subjects based on a few reference images. However, current methods often fail when generating multiple subjects simultaneously, resulting in mixed identities with combined attributes from different subjects. In this work, we present MuDI, a novel framework that enables multi-subject personalization by effectively decoupling identities from multiple subjects. Our main idea is to utilize segmented subjects generated by a foundation model for segmentation (Segment Anything) for both training and inference, as a form of data augmentation for training and initialization for the generation process. Moreover, we further introduce a new metric to better evaluate the performance of our method on multi-subject personalization. Experimental results show that our MuDI can produce high-quality personalized images without identity mixing, even for highly similar subjects as shown in Figure 1. Specifically, in human evaluation, MuDI obtains twice the success rate for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% against the strongest baseline.

Authors (4)

Sangwon Jang (5 papers)
Jaehyeong Jo (14 papers)
Kimin Lee (69 papers)
Sung Ju Hwang (178 papers)

Citations (6)

View on Semantic Scholar

Summary

An Analysis of "Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models"

The paper "Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models" presents MuDI, a novel framework designed to address the prevalent issue of identity mixing in the personalization of text-to-image models for multiple subjects. Despite the notable advancements in single-subject personalization within text-to-image diffusion models, the simultaneous handling of multiple subjects presents a complex challenge, often leading to the undesirable mixing of identities. This paper aims to circumvent these difficulties by employing a mechanism that decouples the identities of multiple subjects using segmentation, specifically leveraging the Segment Anything Model (SAM).

Core Contributions and Methodology

The primary innovation stems from the Seg-Mix data augmentation method, which provides a robust mechanism for identity decoupling during the fine-tuning of pre-trained text-to-image models. The Seg-Mix approach utilizes segmented subjects extracted by SAM, which are then used to craft training images where backgrounds are minimized, and subjects are meticulously composed in randomized or controlled configurations. This aids in decreasing identity-irrelevant artifacts and stitching artifacts typical of previous strategies like Cut-Mix.

Moreover, the paper introduces a novel inference initialization strategy involving mean-shifted noise composed from segmented subjects. This technique serves as an initialization guide, providing a structured noise that encodes coarse layout information of subjects, thus aiding in the preservation of multiple distinct identities during image generation.

Experimental Validation

The MuDI framework's efficacy is examined using a newly constructed dataset, which includes pairs of similar subjects prone to identity mixing, spanning categories such as animals and objects. The evaluation, both qualitative and quantitative, substantiates the framework's superior ability in generating personalized images without identity mixing. Metrics including D{content}C, an innovative assessment for measuring multi-subject fidelity, underline MuDI's ability to outperform existing methods like DreamBooth and Cut-Mix. Additionally, human evaluations reveal a significant preference for MuDI over traditional methods in preserving subject identity and maintaining high fidelity with the text prompts.

Practical and Theoretical Implications

Practically, MuDI's method of leveraging automated segmentation for identity decoupling shows promise in refining text-to-image models for applications demanding precise multi-subject arrangements, enhancing graphics-driven tasks in content creation, digital art, and potentially even virtual reality settings. Theoretically, the paper opens new avenues for exploring model personalization techniques through structured augmentations and initialization methods, which could be further developed or adapted for more complex scene generation and interaction tasks.

Speculation on Future Developments

The authors hint at the expansion of MuDI's framework to accommodate not just identity separation but the modeling of interactions among multiple subjects in more complex environments. Given advancements in LLMs and the evolving capabilities of diffusion models, future research could focus on dynamic scene generation where subjects not only exhibit distinct identities but also participate in detailed interactions reflective of a narrative prompted text.

Overall, the paper effectively addresses identity mixing through innovative use of segmentation and structured noise initialization, proving substantial advances in multi-subject personalization for text-to-image models. As such, it constitutes a commendable stride toward more nuanced and scalable personalizations in generative AI.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/Jaehyeong_Jo/status/1858474650055082202

https://twitter.com/gpbhupinder/status/1858491862622855348