TediGAN: Text-Guided Diverse Face Image Generation and Manipulation (2012.03308v3)

Published 6 Dec 2020 in cs.CV, cs.AI, and cs.MM

Abstract: In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module maps real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can produce diverse and high-quality images with an unprecedented resolution at 1024. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels, with or without instance guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.

PDF Abstract

Overview of TediGAN: Text-Guided Diverse Face Image Generation and Manipulation

The research paper "TediGAN: Text-Guided Diverse Face Image Generation and Manipulation" presents a sophisticated framework designed for the generation and manipulation of facial images using textual descriptions. The framework, named TediGAN, integrates several components to enhance the efficacy and quality of multi-modal image synthesis, thus providing significant advancements in the field of computer vision.

Key Components of TediGAN

TediGAN is structured around three primary components:

StyleGAN Inversion Module: This module inverts real images into the latent space of a pretrained StyleGAN, allowing for high-quality image reconstructions and semantically meaningful manipulations. By leveraging a fixed StyleGAN model, TediGAN circumvents the need for paired text-image datasets, while maintaining high photorealism and diversity in the generated images.
Visual-Linguistic Similarity Learning: Enabled by a novel technique that maps both visual and linguistic inputs into a common embedding space in the StyleGAN latent domain, this component ensures coherent correspondence between the text and the visual attributes. Through such visual-linguistic alignments, TediGAN allows for precise image editing directed by textual descriptions.
Instance-Level Optimization: This optimization aspect preserves identity integrity during the manipulation process. It facilitates precise attribute edits as per the input text while maintaining the consistency of text-irrelevant features. This is achieved through a regularization that includes an image encoder to keep the modifications within the semantic domain of the StyleGAN generator.

Empirical Performance and Comparative Analysis

TediGAN demonstrates superior performance across multiple metrics when tested on the newly introduced Multi-Modal CelebA-HQ dataset. The dataset comprises real face images paired with semantic segmentation maps, sketches, and textual descriptions.

FID and LPIPS Scores: TediGAN excels in terms of image quality (FID) and diversity (LPIPS), outperforming several state-of-the-art methods such as AttnGAN, ControlGAN, DFGAN, and DM-GAN. It shows robust capability in generating images that are both high in quality and diverse in content.
Accuracy and Realism: User studies show that images generated by TediGAN are not only more realistic but also closely aligned with the provided textual descriptions compared to competing methods.
Resolution and Diversity: TediGAN achieves high-resolution outputs at 1024×1024 pixels, marking a significant improvement over existing text-to-image generation methods, which often suffer from quality degradation at higher resolutions.

Theoretical and Practical Implications

The approach proposed in TediGAN highlights an important intersection between GAN inversion and multi-modal data processing, suggesting new directions for future research. The effective integration of StyleGAN inversion with textual input mappings extends the potential applications of GANs in creative fields, personalized avatars, and beyond.

On the theoretical front, TediGAN's framework encourages further exploration of disentangled representations and multi-modal embeddings in deep models. Practically, the introduction of the Multi-Modal CelebA-HQ dataset provides a valuable resource for further research and development in facial image synthesis and understanding.

Concluding Remarks

TediGAN provides a compelling solution for text-guided image synthesis, demonstrating both versatility and effectiveness across various tasks. Its unified framework for both generation and manipulation opens pathways for more interactive and user-friendly image editing tools. Future work may focus on enhancing further the disentanglement of attributes within the StyleGAN latent space and improving the handling of underrepresented visual elements such as accessories.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Weihao Xia (26 papers)
Yujiu Yang (155 papers)
Jing-Hao Xue (54 papers)
Baoyuan Wu (107 papers)

Citations (22)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - IIGROUP/TediGAN: [CVPR 2021] Pytorch implementation for TediGAN: Text-Guided Diverse Face Image Generation and Manipulation (389 stars)