ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning (2404.15449v1)

Published 23 Apr 2024 in cs.CV and cs.AI

Abstract: The rapid development of diffusion models has triggered diverse applications. Identity-preserving text-to-image generation (ID-T2I) particularly has received significant attention due to its wide range of application scenarios like AI portrait and advertising. While existing ID-T2I methods have demonstrated impressive results, several key challenges remain: (1) It is hard to maintain the identity characteristics of reference portraits accurately, (2) The generated images lack aesthetic appeal especially while enforcing identity retention, and (3) There is a limitation that cannot be compatible with LoRA-based and Adapter-based methods simultaneously. To address these issues, we present \textbf{ID-Aligner}, a general feedback learning framework to enhance ID-T2I performance. To resolve identity features lost, we introduce identity consistency reward fine-tuning to utilize the feedback from face detection and recognition models to improve generated identity preservation. Furthermore, we propose identity aesthetic reward fine-tuning leveraging rewards from human-annotated preference data and automatically constructed feedback on character structure generation to provide aesthetic tuning signals. Thanks to its universal feedback fine-tuning framework, our method can be readily applied to both LoRA and Adapter models, achieving consistent performance gains. Extensive experiments on SD1.5 and SDXL diffusion models validate the effectiveness of our approach. \textbf{Project Page: \url{https://idaligner.github.io/}}

PDF HTML Abstract

Enhancing Identity-Preserving Text-to-Image Generation with ID-Aligner: A General Feedback Learning Framework

Introduction

Recent advancements in diffusion models have significantly impacted text-to-image generation tasks, particularly those requiring the preservation of specific identity features from text descriptions. The ID-Aligner framework introduced in the paper by Chen et al. addresses critical challenges in identity-preserving text-to-image (ID-T2I) generation using a novel feedback learning approach. This framework employs identity consistency and aesthetic rewards to fine-tune model outputs, showing improvements in identity preservation and image quality across various diffusion models, including SD1.5 and SDXL.

Key Challenges and ID-Aligner Framework

Existing ID-T2I methods encounter several challenges: accurately maintaining the identity features of reference portraits, ensuring generated images retain aesthetic appeal, and compatibility issues with both LoRA-based and Adapter-based methodologies. ID-Aligner confronts these challenges with a dual-strategy framework:

Identity Consistency Reward Fine-Tuning: This component uses feedback from face detection and recognition models to improve identity alignment between generated images and reference portraits. The method measures identity consistency using cosine similarity between the face embeddings of the generated and reference images.
Identity Aesthetic Reward Fine-Tuning: To enhance the visual appeal of generated images and overcome the rigidity often exhibited in ID-T2I, this component utilizes human-annotated preference data and automatically constructed feedback on character structure. This reward guides the generation process to produce more aesthetically pleasing images.

Implementation and Flexibility

The ID-Aligner framework can be seamlessly integrated into both LoRA and Adapter models, offering a flexible solution that adapts to existing ID-T2I methodologies. The framework's universal feedback fine-tuning allows for consistent performance improvements not confined to a single type of diffusion model or approach. Additionally, the integration process within these models is detailed, demonstrating how the identity and aesthetic rewards are deployed during the generation process.

Experimental Validation

Extensive experiments validate the effectiveness of ID-Aligner. These tests cover various scenarios and benchmarks, comparing the enhanced performance in identity preservation and aesthetic improvements against existing methods such as IP-Adapter, PhotoMaker, and InstantID. The results highlight significant improvements in maintaining identity features and generating visually appealing images.

Theoretical and Practical Implications

The application of feedback learning in ID-T2I not only improves model performance but also provides insights into the design of more robust generative frameworks for identity-sensitive applications. Practically, the ID-Aligner can greatly benefit areas such as personalized advertising and virtual try-ons where identity preservation from textual descriptions is crucial. Theoretically, it extends the understanding of feedback mechanisms in image generation tasks, providing a pathway for future research in generative models' fine-tuning strategies.

Future Research Directions

Looking ahead, the extension of the ID-Aligner framework to include more diverse datasets and scenarios presents a natural progression for this research. Additionally, exploring other types of feedback signals and their integration into the reinforcement learning setup for generative models could offer further enhancements in both performance and flexibility.

Summary

Overall, the ID-Aligner framework represents a significant advancement in identity-preserving text-to-image generation. By effectively utilizing feedback learning, it addresses core issues faced by current methods and sets the stage for more personalized and accurate image generation technologies in the future.