TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training (2309.11923v1)
Abstract: Text-guided image generation aimed to generate desired images conditioned on given texts, while text-guided image manipulation refers to semantically edit parts of a given image based on specified texts. For these two similar tasks, the key point is to ensure image fidelity as well as semantic consistency. Many previous approaches require complex multi-stage generation and adversarial training, while struggling to provide a unified framework for both tasks. In this work, we propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. The proposed method accepts input from images or random noise corresponding to these two different tasks, and under the condition of the specific texts, a carefully designed mapping network that exploits the powerful generative capabilities of StyleGAN and the text image representation capabilities of Contrastive Language-Image Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ dataset have demonstrated that our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
- Image2stylegan: How to embed images into the stylegan latent space?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4432–4441.
- Restyle: A residual-based stylegan encoder via iterative refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6711–6720.
- Image-based clip-guided essence transfer. arXiv preprint arXiv:2110.12427 (2021).
- RiFeGAN: Rich feature generation for text-to-image synthesis from prior knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10911–10920.
- Navigating the gan parameter space for semantic image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3671–3680.
- Editing in style: Uncovering the local semantics of gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5771–5780.
- Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690–4699.
- Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34 (2021).
- A disentangling invertible interpretation network for explaining latent representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9223–9232.
- Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021).
- Generative adversarial nets. Advances in neural information processing systems 27 (2014).
- SegAttnGAN: Text to image generation with segmentation attention. arXiv preprint arXiv:2005.12444 (2020).
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
- TextFace: Text-to-Style Mapping based Face Generation and Manipulation. IEEE Transactions on Multimedia (2022).
- Curricularface: adaptive curriculum learning loss for deep face recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5901–5910.
- Transforming and projecting images into class-conditional generative networks. In European Conference on Computer Vision. Springer, 17–34.
- Talk-to-Edit: Fine-Grained Facial Editing via Dialog. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13799–13808.
- Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer, 694–711.
- Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems 33 (2020), 12104–12114.
- Alias-free generative adversarial networks. Advances in Neural Information Processing Systems 34 (2021).
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
- Controllable text-to-image generation. Advances in Neural Information Processing Systems 32 (2019).
- Manigan: Text-guided image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7880–7889.
- Segmentation in style: Unsupervised semantic image segmentation with stylegan and clip. arXiv preprint arXiv:2107.12518 (2021).
- Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085–2094.
- Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1505–1514.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
- Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
- Generative adversarial text to image synthesis. In International conference on machine learning. PMLR, 1060–1069.
- Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2287–2296.
- Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13960–13969.
- StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis. arXiv preprint arXiv:2111.03133 (2021).
- Semantic and Geometric Unfolding of StyleGAN Latent Space. arXiv preprint arXiv:2107.04481 (2021).
- Multi-caption Text-to-Face Synthesis: Dataset and Algorithm. In Proceedings of the 29th ACM International Conference on Multimedia. 2290–2298.
- Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865 (2020).
- Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–14.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Stylegan2 distillation for feed-forward image manipulation. In European Conference on Computer Vision. Springer, 170–186.
- Cycle-consistent inverse gan for text-to-image synthesis. In Proceedings of the 29th ACM International Conference on Multimedia. 630–638.
- Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12863–12872.
- Tedigan: Text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2256–2265.
- Towards open-world text-guided face image generation and manipulation. arXiv preprint arXiv:2104.08910 (2021).
- GAN inversion: A survey. arXiv preprint arXiv:2101.05278 (2021).
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1316–1324.
- Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 833–842.
- Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision. 5907–5915.
- Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2018), 1947–1962.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
- Dance Generation with Style Embedding: Learning and Transferring Latent Representations of Dance Styles. arXiv preprint arXiv:2104.14802 (2021).
- Yutong Zhou. 2021. Generative adversarial network for text-to-face synthesis and manipulation. In Proceedings of the 29th ACM International Conference on Multimedia. 2940–2944.
- Bin Zhu and Chong-Wah Ngo. 2020. CookGAN: Causality based text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5519–5527.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223–2232.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5802–5810.
- Xiaozhou You (2 papers)
- Jian Zhang (542 papers)