Improving face generation quality and prompt following with synthetic captions (2405.10864v1)
Abstract: Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate appearance descriptions from images of people. We apply this method to create approximately 250,000 captions for publicly available face datasets. We then use these synthetic captions to fine-tune a text-to-image diffusion model. Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces and enhances adherence to the given prompts, compared to the baseline model. We share our synthetic captions, pretrained checkpoints and training code.
- https://cdn.openai.com/papers/dall-e-3.pdf.
- https://huggingface.co/stabilityai/stable-diffusion-2-1.
- Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Scaling rectified flow transformers for high-resolution image synthesis, 2024.
- Denoising diffusion probabilistic models. CoRR, abs/2006.11239, 2020.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Easyportrait - face parsing and portrait segmentation dataset. arXiv preprint arXiv:2304.13509, 2023.
- A style-based generator architecture for generative adversarial networks. CoRR, abs/1812.04948, 2018.
- Elucidating the design space of diffusion-based generative models, 2022.
- Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021.
- Training language models to follow instructions with human feedback, 2022.
- Facial expression recognition using residual masking network. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4513–4519. IEEE, 2021.
- Language models are unsupervised multitask learners. 2019.
- Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
- Lightface: A hybrid deep face recognition framework. In 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), pages 23–27. IEEE, 2020.
- Denoising diffusion implicit models, 2020a.
- Score-based generative modeling through stochastic differential equations. CoRR, abs/2011.13456, 2020b.
- Llama: Open and efficient foundation language models, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022.
- Michail Tarasiou (10 papers)
- Stylianos Moschoglou (18 papers)
- Jiankang Deng (96 papers)
- Stefanos Zafeiriou (137 papers)