Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Face2Diffusion for Fast and Editable Face Personalization (2403.05094v1)

Published 8 Mar 2024 in cs.CV

Abstract: Face personalization aims to insert specific faces, taken from images, into pretrained text-to-image diffusion models. However, it is still challenging for previous methods to preserve both the identity similarity and editability due to overfitting to training samples. In this paper, we propose Face2Diffusion (F2D) for high-editability face personalization. The core idea behind F2D is that removing identity-irrelevant information from the training pipeline prevents the overfitting problem and improves editability of encoded faces. F2D consists of the following three novel components: 1) Multi-scale identity encoder provides well-disentangled identity features while keeping the benefits of multi-scale information, which improves the diversity of camera poses. 2) Expression guidance disentangles face expressions from identities and improves the controllability of face expressions. 3) Class-guided denoising regularization encourages models to learn how faces should be denoised, which boosts the text-alignment of backgrounds. Extensive experiments on the FaceForensics++ dataset and diverse prompts demonstrate our method greatly improves the trade-off between the identity- and text-fidelity compared to previous state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Face landmark controlnet. https://huggingface.co/georgefen/Face-Landmark-ControlNet.
  2. Deepfake detection dataset. https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html.
  3. Facenet pytorch. https://github.com/timesler/facenet-pytorch/.
  4. Insightface. https://github.com/deepinsight/insightface.
  5. Sphereface pytorch. https://github.com/clcarwin/sphereface_pytorch.
  6. Large scale gan training for high fidelity natural image synthesis. In ICLR, 2018.
  7. VGGFace2: A dataset for recognising faces across pose and age. In FG, 2018.
  8. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  9. Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023a.
  10. PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023b.
  11. Dreamidentity: Improved editability for efficient face-identity preserved image generation. arXiv preprint arXiv:2307.00300, 2023c.
  12. Arcface: Additive Angular Margin Loss for Deep Face Recognition. In CVPR, 2019a.
  13. Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set. In CVPR Workshop, 2019b.
  14. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  16. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946, 2021.
  17. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2022.
  18. Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228, 2023.
  19. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In ECCV, 2016.
  20. CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
  21. Gans trained by a two time-scale update rule converge to a local nash equilibrium. 2017.
  22. Classifier-free diffusion guidance. In NeurIPS Workshop, 2021.
  23. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  24. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
  25. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  26. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
  27. Adaface: Quality adaptive margin for face recognition. In CVPR, 2022.
  28. Auto-encoding variational bayes. In ICLR, 2014.
  29. Overcoming catastrophic forgetting in neural networks. PNAS, 2017.
  30. Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
  31. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  32. Countering language drift via visual grounding. In EMNLP-IJCNLP, 2019.
  33. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017.
  34. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
  35. Andrew L. Maas. Rectifier nonlinearities improve neural network acoustic models. 2013.
  36. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  37. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
  38. Multi-scale arc-fusion based feature embedding for small-scale biometrics. Neural Processing Letters, 2023.
  39. Controlling text-to-image diffusion by orthogonal finetuning. In NeurIPS, 2023.
  40. Learning transferable visual models from natural language supervision. In ICML, 2021.
  41. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  42. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  43. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  44. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  45. Faceforensics++: Learning to detect manipulated facial images. In ICCV, 2019.
  46. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  47. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  48. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
  49. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  50. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
  51. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  52. Denoising diffusion implicit models. In ICLR, 2020.
  53. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
  54. Key-locked rank one editing for text-to-image personalization. In SIGGRAPH, 2023.
  55. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  56. Is this loss informative? speeding up textual inversion with deterministic objective evaluation. In NeurIPS, 2023.
  57. Cosface: Large margin cosine loss for deep face recognition. In CVPR, 2018.
  58. Dire for diffusion-generated image detection. In ICCV, 2023.
  59. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In ICCV, 2023.
  60. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
  61. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In ECCV, 2018.
  62. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  63. Inserting anybody in diffusion models via celeb basis. In NeurIPS, 2023.
  64. Sigmoid loss for language image pre-training. In ICCV, 2023.
Citations (7)

Summary

  • The paper introduces Face2Diffusion, a novel method that enhances face personalization by disentangling identity features for improved editability.
  • It employs multi-scale identity encoding, expression guidance, and class-guided denoising regularization to balance identity preservation with text fidelity.
  • Experimental results on FaceForensics++ demonstrate that F2D outperforms state-of-the-art methods across multiple key quality metrics.

An Overview of Face2Diffusion: A Novel Approach for High-Editability Face Personalization

Introduction

Face personalization has gained significant attention in recent years due to its potential applications across various domains, including content creation and digital entertainment. It involves the process of inserting specific faces, captured from images, into pretrained text-to-image (T2I) diffusion models. Despite considerable advancements in the field, achieving a balance between identity similarity preservation and editability of generated images remains a formidable challenge. This paper introduces Face2Diffusion (F2D), a method designed to enhance editability in face personalization significantly.

Methodology Overview

The essence of F2D revolves around the strategic removal of identity-irrelevant information during the training process. This approach fundamentally aids in preventing the model from overfitting to training samples, thereby improving the editability of encoded faces. F2D incorporates three novel components:

Multi-scale Identity Encoder

This component aims to provide well-disentangled identity features while retaining the advantages of multi-scale information processing. By focusing on the classification of identities across multiple scales, the proposed encoder enhances the diversity of generated images in terms of camera poses without compromising identity fidelity.

Expression Guidance

To address the challenge of disentangling face expressions from identities, F2D employs expression guidance. This mechanism allows for the controllable manipulation of face expressions in generated images, thereby improving the model's ability to align with diverse text prompts.

Class-guided Denoising Regularization

The introduction of class-guided denoising regularization encourages the model to learn specific denoising patterns for faces, in alignment with their super-class word, i.e., "a person." This regularization method directly contributes to improving the text-alignment of backgrounds in the generated images.

Experimental Results

The evaluation of F2D was conducted on the FaceForensics++ dataset using a diverse set of prompts. The results indicate a substantial improvement in the trade-off between identity- and text-fidelity compared to several state-of-the-art methods. Specifically, F2D consistently ranks top-3 in five out of six metrics and excels in the harmonic and geometric means of these metrics, underscoring its superiority in total quality face personalization.

Theoretical and Practical Implications

The proposed Face2Diffusion method opens new avenues for research in face personalization by addressing the critical challenge of editability. It demonstrates the significance of disentangling identity-relevant and identity-irrelevant information in the training process. Practically, F2D holds potential for applications in content creation, where personalized and editable face generation is of paramount importance.

Future Directions

The exploration of F2D lays the groundwork for future developments in AI-driven face personalization. Potential areas for further research include the refinement of identity encoders for enhanced fidelity, the exploration of expression guidance mechanisms for a broader range of emotions, and the advancement of denoising regularization techniques for more contextually relevant background generation.

X Twitter Logo Streamline Icon: https://streamlinehq.com