Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning (2404.15449v1)

Published 23 Apr 2024 in cs.CV and cs.AI

Abstract: The rapid development of diffusion models has triggered diverse applications. Identity-preserving text-to-image generation (ID-T2I) particularly has received significant attention due to its wide range of application scenarios like AI portrait and advertising. While existing ID-T2I methods have demonstrated impressive results, several key challenges remain: (1) It is hard to maintain the identity characteristics of reference portraits accurately, (2) The generated images lack aesthetic appeal especially while enforcing identity retention, and (3) There is a limitation that cannot be compatible with LoRA-based and Adapter-based methods simultaneously. To address these issues, we present \textbf{ID-Aligner}, a general feedback learning framework to enhance ID-T2I performance. To resolve identity features lost, we introduce identity consistency reward fine-tuning to utilize the feedback from face detection and recognition models to improve generated identity preservation. Furthermore, we propose identity aesthetic reward fine-tuning leveraging rewards from human-annotated preference data and automatically constructed feedback on character structure generation to provide aesthetic tuning signals. Thanks to its universal feedback fine-tuning framework, our method can be readily applied to both LoRA and Adapter models, achieving consistent performance gains. Extensive experiments on SD1.5 and SDXL diffusion models validate the effectiveness of our approach. \textbf{Project Page: \url{https://idaligner.github.io/}}

Enhancing Identity-Preserving Text-to-Image Generation with ID-Aligner: A General Feedback Learning Framework

Introduction

Recent advancements in diffusion models have significantly impacted text-to-image generation tasks, particularly those requiring the preservation of specific identity features from text descriptions. The ID-Aligner framework introduced in the paper by Chen et al. addresses critical challenges in identity-preserving text-to-image (ID-T2I) generation using a novel feedback learning approach. This framework employs identity consistency and aesthetic rewards to fine-tune model outputs, showing improvements in identity preservation and image quality across various diffusion models, including SD1.5 and SDXL.

Key Challenges and ID-Aligner Framework

Existing ID-T2I methods encounter several challenges: accurately maintaining the identity features of reference portraits, ensuring generated images retain aesthetic appeal, and compatibility issues with both LoRA-based and Adapter-based methodologies. ID-Aligner confronts these challenges with a dual-strategy framework:

  1. Identity Consistency Reward Fine-Tuning: This component uses feedback from face detection and recognition models to improve identity alignment between generated images and reference portraits. The method measures identity consistency using cosine similarity between the face embeddings of the generated and reference images.
  2. Identity Aesthetic Reward Fine-Tuning: To enhance the visual appeal of generated images and overcome the rigidity often exhibited in ID-T2I, this component utilizes human-annotated preference data and automatically constructed feedback on character structure. This reward guides the generation process to produce more aesthetically pleasing images.

Implementation and Flexibility

The ID-Aligner framework can be seamlessly integrated into both LoRA and Adapter models, offering a flexible solution that adapts to existing ID-T2I methodologies. The framework's universal feedback fine-tuning allows for consistent performance improvements not confined to a single type of diffusion model or approach. Additionally, the integration process within these models is detailed, demonstrating how the identity and aesthetic rewards are deployed during the generation process.

Experimental Validation

Extensive experiments validate the effectiveness of ID-Aligner. These tests cover various scenarios and benchmarks, comparing the enhanced performance in identity preservation and aesthetic improvements against existing methods such as IP-Adapter, PhotoMaker, and InstantID. The results highlight significant improvements in maintaining identity features and generating visually appealing images.

Theoretical and Practical Implications

The application of feedback learning in ID-T2I not only improves model performance but also provides insights into the design of more robust generative frameworks for identity-sensitive applications. Practically, the ID-Aligner can greatly benefit areas such as personalized advertising and virtual try-ons where identity preservation from textual descriptions is crucial. Theoretically, it extends the understanding of feedback mechanisms in image generation tasks, providing a pathway for future research in generative models' fine-tuning strategies.

Future Research Directions

Looking ahead, the extension of the ID-Aligner framework to include more diverse datasets and scenarios presents a natural progression for this research. Additionally, exploring other types of feedback signals and their integration into the reinforcement learning setup for generative models could offer further enhancements in both performance and flexibility.

Summary

Overall, the ID-Aligner framework represents a significant advancement in identity-preserving text-to-image generation. By effectively utilizing feedback learning, it addresses core issues faced by current methods and sets the stage for more personalized and accurate image generation technologies in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Training Diffusion Models with Reinforcement Learning. arXiv:2305.13301 [cs.LG]
  2. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800 (2022).
  3. PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models. ([n. d.]).
  4. Taming Transformers for High-Resolution Image Synthesis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr46437.2021.01268
  5. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
  6. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]
  7. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL]
  8. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. (Aug 2023).
  9. Composer: Creative and Controllable Image Synthesis with Composable Conditions. (Feb 2023).
  10. Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML]
  11. PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding. (Dec 2023).
  12. Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning. (Jul 2023).
  13. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073 [cs.CV]
  14. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023).
  15. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741 [cs.CV]
  16. OpenAI. 2023. Introducing chatgpt. arXiv:2303.08774 [cs.CL]
  17. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  18. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
  19. JourneyDB: A Benchmark for Generative Image Understanding. arXiv:2307.00716 [cs.CV]
  20. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. ([n. d.]).
  21. UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild. arXiv:2305.11147 [cs.CV]
  22. Learning Transferable Visual Models From Natural Language Supervision. Cornell University - arXiv,Cornell University - arXiv (Feb 2021).
  23. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125 [cs.CV]
  24. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
  25. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation.
  26. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487 [cs.CV]
  27. FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2015.7298682
  28. LAION-5B: An open large-scale dataset for training next generation image-text models. ([n. d.]).
  29. InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning. (Apr 2023).
  30. Denoising Diffusion Implicit Models. arXiv:2010.02502 [cs.LG]
  31. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. arXiv preprint arXiv:2211.12572 (2022).
  32. InstantID: Zero-shot Identity-Preserving Generation in Seconds. arXiv:2401.07519 [cs.CV]
  33. ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation. (Feb 2023).
  34. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. arXiv:2306.09341 [cs.CV]
  35. Human Preference Score: Better Aligning Text-to-Image Models with Human Preference. arXiv:2303.14420 [cs.CV]
  36. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention. ([n. d.]).
  37. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. arXiv:2304.05977 [cs.CV]
  38. FaceStudio: Put Your Face Everywhere in Seconds. (Dec 2023).
  39. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv:2308.06721 [cs.CV]
  40. UniFL: Improve Stable Diffusion via Unified Feedback Learning. ([n. d.]).
  41. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. IEEE Signal Processing Letters (Oct 2016), 1499–1503. https://doi.org/10.1109/lsp.2016.2603342
  42. Adding Conditional Control to Text-to-Image Diffusion Models. arXiv:2302.05543 [cs.CV]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Weifeng Chen (22 papers)
  2. Jiacheng Zhang (52 papers)
  3. Jie Wu (230 papers)
  4. Hefeng Wu (35 papers)
  5. Xuefeng Xiao (51 papers)
  6. Liang Lin (318 papers)
Citations (7)
Reddit Logo Streamline Icon: https://streamlinehq.com