IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

Published 20 Mar 2024 in cs.CV | (2403.13535v2)

Abstract: Leveraging Stable Diffusion for the generation of personalized portraits has emerged as a powerful and noteworthy tool, enabling users to create high-fidelity, custom character avatars based on their specific prompts. However, existing personalization methods face challenges, including test-time fine-tuning, the requirement of multiple input images, low preservation of identity, and limited diversity in generated outcomes. To overcome these challenges, we introduce IDAdapter, a tuning-free approach that enhances the diversity and identity preservation in personalized image generation from a single face image. IDAdapter integrates a personalized concept into the generation process through a combination of textual and visual injections and a face identity loss. During the training phase, we incorporate mixed features from multiple reference images of a specific identity to enrich identity-related content details, guiding the model to generate images with more diverse styles, expressions, and angles compared to previous works. Extensive evaluations demonstrate the effectiveness of our method, achieving both diversity and identity fidelity in generated images.

Abstract PDF HTML Upgrade to Chat

Authors (7)

References (49)

Citations (14)

View on Semantic Scholar

Summary

The paper introduces IDAdapter, eliminating test-time fine-tuning by leveraging mixed facial features from a single reference image.
It integrates a lightweight transformer and adapter layer within the Stable Diffusion architecture to preserve identity while enabling style variation.
Empirical evaluations on CelebA-HQ and VGGFace2 confirm superior identity preservation and diverse facial expression and pose generation.

An Academic Overview of IDAdapter: Tuning-Free Personalization of Text-to-Image Models

The paper "IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models" addresses the limitations faced by traditional text-to-image personalization methods and introduces a novel framework. The authors propose IDAdapter, a method that enhances personalized image generation by eliminating the need for test-time fine-tuning, enabling the generation of diverse and high-fidelity images from just a single facial image. This paper situates itself within the context of recent advancements in text-to-image (T2I) synthesis, emphasizing efficiency and user-friendliness.

Core Contributions

Elimination of Test-Time Fine-Tuning: The IDAdapter method innovatively sidesteps the prevalent requirement of conducting fine-tuning during inference. This is achieved by training a model capable of accepting a single reference image, which significantly shortens the personalization process, removing the computational burdens typically associated with fine-tuning.
Introduction of Mixed Facial Features (MFF): At the heart of IDAdapter is the Mixed Facial Features module. This module combines visual patch features extracted from reference images with the identity features garnered from a face recognition encoder (Arcface). During training, MFF synthesizes these features to enrich content details relevant to identity while allowing for variations in expressions, styles, and angles, thus promoting both identity fidelity and diversity in output images.
Integration with Stable Diffusion Architecture: By incorporating the proposed MFF features into the Stable Diffusion model through a lightweight transformer, the IDAdapter approach facilitates the generation of images that maintain subject identity over a range of prompts. This integration is further enhanced through an adapter layer in the UNet, which allows the model to leverage visual features without modifying the primary weights extensively.

Empirical Evaluation

The effectiveness of the IDAdapter framework is substantiated through extensive empirical evaluations. The model is trained and assessed on both Multi-Modal CelebA-HQ and VGGFace2 datasets, demonstrating its ability to generate images that score highly on both identity preservation and diversity metrics. Specifically:

Identity Preservation: The paper reports a high cosine similarity value when comparing generated images with reference images, reflecting the model's strength in preserving identity.
Diversity Measurement: The model's outputs showcase substantial variance in facial angles and expressions across different generated images. This is quantitatively measured using Pose-Div and Expr-Div metrics, where IDAdapter outperforms other baseline methodologies.

Speculative Implications and Future Work

IDAdapter paves the way for more computationally efficient and accessible T2I personalization models. This could have significant implications for applications ranging from personalized avatar creation to entertainment and art. Future work could explore expanding the IDAdapter framework to other domains and subjects beyond facial images, as well as potential integration with multi-modal datasets to enhance text and visual context comprehension further.

In conclusion, IDAdapter represents a significant advancement in T2I synthesis, effectively balancing the demands of identity preservation with the creative freedom of style variation. Its tuning-free personalization promises to make high-quality custom image synthesis more accessible and efficient, thus potentially catalyzing broader adoption and deeper embedding in various practical applications.

Markdown Report Issue