Face2Diffusion for Fast and Editable Face Personalization (2403.05094v1)
Abstract: Face personalization aims to insert specific faces, taken from images, into pretrained text-to-image diffusion models. However, it is still challenging for previous methods to preserve both the identity similarity and editability due to overfitting to training samples. In this paper, we propose Face2Diffusion (F2D) for high-editability face personalization. The core idea behind F2D is that removing identity-irrelevant information from the training pipeline prevents the overfitting problem and improves editability of encoded faces. F2D consists of the following three novel components: 1) Multi-scale identity encoder provides well-disentangled identity features while keeping the benefits of multi-scale information, which improves the diversity of camera poses. 2) Expression guidance disentangles face expressions from identities and improves the controllability of face expressions. 3) Class-guided denoising regularization encourages models to learn how faces should be denoised, which boosts the text-alignment of backgrounds. Extensive experiments on the FaceForensics++ dataset and diverse prompts demonstrate our method greatly improves the trade-off between the identity- and text-fidelity compared to previous state-of-the-art methods.
- Face landmark controlnet. https://huggingface.co/georgefen/Face-Landmark-ControlNet.
- Deepfake detection dataset. https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html.
- Facenet pytorch. https://github.com/timesler/facenet-pytorch/.
- Insightface. https://github.com/deepinsight/insightface.
- Sphereface pytorch. https://github.com/clcarwin/sphereface_pytorch.
- Large scale gan training for high fidelity natural image synthesis. In ICLR, 2018.
- VGGFace2: A dataset for recognising faces across pose and age. In FG, 2018.
- Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023a.
- PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023b.
- Dreamidentity: Improved editability for efficient face-identity preserved image generation. arXiv preprint arXiv:2307.00300, 2023c.
- Arcface: Additive Angular Margin Loss for Deep Face Recognition. In CVPR, 2019a.
- Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set. In CVPR Workshop, 2019b.
- Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946, 2021.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2022.
- Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228, 2023.
- Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In ECCV, 2016.
- CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. 2017.
- Classifier-free diffusion guidance. In NeurIPS Workshop, 2021.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
- Adaface: Quality adaptive margin for face recognition. In CVPR, 2022.
- Auto-encoding variational bayes. In ICLR, 2014.
- Overcoming catastrophic forgetting in neural networks. PNAS, 2017.
- Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
- Countering language drift via visual grounding. In EMNLP-IJCNLP, 2019.
- Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017.
- Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
- Andrew L. Maas. Rectifier nonlinearities improve neural network acoustic models. 2013.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
- Multi-scale arc-fusion based feature embedding for small-scale biometrics. Neural Processing Letters, 2023.
- Controlling text-to-image diffusion by orthogonal finetuning. In NeurIPS, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Faceforensics++: Learning to detect manipulated facial images. In ICCV, 2019.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. In ICLR, 2020.
- Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
- Key-locked rank one editing for text-to-image personalization. In SIGGRAPH, 2023.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Is this loss informative? speeding up textual inversion with deterministic objective evaluation. In NeurIPS, 2023.
- Cosface: Large margin cosine loss for deep face recognition. In CVPR, 2018.
- Dire for diffusion-generated image detection. In ICCV, 2023.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In ICCV, 2023.
- Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
- BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In ECCV, 2018.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
- Inserting anybody in diffusion models via celeb basis. In NeurIPS, 2023.
- Sigmoid loss for language image pre-training. In ICCV, 2023.