Conditioning Diffusion Models via Attributes and Semantic Masks for Face Generation (2306.00914v3)
Abstract: Deep generative models have shown impressive results in generating realistic images of faces. GANs managed to generate high-quality, high-fidelity images when conditioned on semantic masks, but they still lack the ability to diversify their output. Diffusion models partially solve this problem and are able to generate diverse samples given the same condition. In this paper, we propose a multi-conditioning approach for diffusion models via cross-attention exploiting both attributes and semantic masks to generate high-quality and controllable face images. We also studied the impact of applying perceptual-focused loss weighting into the latent space instead of the pixel space. Our method extends the previous approaches by introducing conditioning on more than one set of features, guaranteeing a more fine-grained control over the generated face images. We evaluate our approach on the CelebA-HQ dataset, and we show that it can generate realistic and diverse samples while allowing for fine-grained control over multiple attributes and semantic regions. Additionally, we perform an ablation study to evaluate the impact of different conditioning strategies on the quality and diversity of the generated images.
- Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
- Perception prioritized training of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11472–11481, 2022.
- Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020.
- Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- High-fidelity and arbitrary face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16115–16124, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(47):1–33, 2022.
- Guidedstyle: Attribute knowledge guided style manipulation for semantic face editing. Neural Networks, 145:209–220, 2022.
- Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- Score matching model for unbounded data score. arXiv preprint arXiv:2106.05527, 7, 2021.
- Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
- Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5549–5558, 2020.
- Controllable text-to-image generation. Advances in Neural Information Processing Systems, 32, 2019.
- Image-to-image translation via hierarchical style disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8639–8648, 2021.
- Stylet2i: Toward compositional and high-fidelity text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18197–18207, 2022.
- Learning to predict layout-to-image conditional convolutions for semantic image synthesis. Advances in Neural Information Processing Systems, 32, 2019.
- Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019.
- Wavelet diffusion models are fast and scalable image generators. arXiv preprint arXiv:2211.16152, 2022.
- Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13960–13969, 2021.
- Learning residual images for face attribute manipulation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4030–4038, 2017.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Diverse semantic image synthesis via probability distribution modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7962–7971, 2021.
- Efficient semantic image synthesis via class-adaptive normalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4852–4866, 2021.
- Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018.
- Semantic image synthesis via diffusion models. arXiv preprint arXiv:2207.00050, 2022.
- Image synthesis via semantic composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13749–13758, 2021.
- Chen Henry Wu and Fernando De la Torre. Unifying diffusion models’ latent space, with applications to cyclediffusion and guidance. arXiv preprint arXiv:2210.05559, 2022.
- Relgan: Multi-domain image-to-image translation via relative attributes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5914–5922, 2019.
- Tedigan: Text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2256–2265, 2021.
- Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804, 2021.
- Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024, 2019.
- Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11304–11314, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- Toward multimodal image-to-image translation. Advances in neural information processing systems, 30, 2017.
- Sean: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5104–5113, 2020.
- Nico Giambi (1 paper)
- Giuseppe Lisanti (19 papers)