Towards Open-World Text-Guided Face Image Generation and Manipulation (2104.08910v1)

Published 18 Apr 2021 in cs.CV and cs.MM

Abstract: The existing text-guided image synthesis methods can only produce limited quality results with at most \mbox{$\text{256}^2$} resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing. To be specific, we propose a brand new paradigm of text-guided image generation and manipulation based on the superior characteristics of a pretrained GAN model. Our proposed paradigm includes two novel strategies. The first strategy is to train a text encoder to obtain latent codes that align with the hierarchically semantic of the aforementioned pretrained GAN model. The second strategy is to directly optimize the latent codes in the latent space of the pretrained GAN model with guidance from a pretrained LLM. The latent codes can be randomly sampled from a prior distribution or inverted from a given image, which provides inherent supports for both image generation and manipulation from multi-modal inputs, such as sketches or semantic labels, with textual guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.

PDF Abstract

Overview of "Towards Open-World Text-Guided Face Image Generation and Manipulation"

The paper presents TediGAN, a unified framework for text-guided image generation and manipulation using pretrained generative adversarial networks (GANs), specifically focusing on face images. TediGAN achieves high-resolution image synthesis and manipulation from multimodal inputs, such as text, sketches, and semantic labels, without additional training or post-processing.

Methodology

Pretrained GAN Utilization

The core of TediGAN's approach relies on the use of a pretrained StyleGAN model, leveraging its latent $\mathcal{W}$ space for both image generation and manipulative tasks. This pretrained model provides a rich, semantically meaningful space that can be exploited by additional encoders.

Image and Text Encoding

TediGAN introduces two key strategies for aligning multimodal inputs with the pretrained GAN's latent space:

Trained Text Encoder Approach: This strategy involves a dedicated text encoder trained to project linguistic inputs into the same $\mathcal{W}$ space as visual inputs. The visual-linguistic similarity module ensures semantically meaningful alignment by leveraging StyleGAN's hierarchical latent structure.
Pretrained LLMs: The second strategy incorporates pretrained LLMs like CLIP to directly optimize latent codes with guidance from text-image similarity scores.

Instance-Level Optimization

For image manipulation tasks, TediGAN utilizes an instance-level optimization module, which enhances identity preservation and precise attribute modification. This is achieved by further refinements on the inverted latent codes, guided by both pixel-level and semantic criteria.

Integration and Control Mechanisms

A style mixing mechanism is employed to facilitate diverse outputs and precise control over attributes, utilizing the layer-specific semantic knowledge of the latent space. This mechanism enables targeted editability based on textual instructions or other modalities such as sketches.

Experimental Results

The introduction of the Multi-Modal CelebA-HQ dataset provides a robust benchmark for evaluating TediGAN against state-of-the-art models, including AttnGAN, DM-GAN, and ManiGAN. Quantitative evaluations using metrics such as FID and LPIPS, alongside user studies for realism and text-image coherence, demonstrate TediGAN's superior performance in generating high-quality, consistent, and photo-realistic results.

Practical and Theoretical Implications

Practically, TediGAN proposes significant advancements for applications requiring seamless transition between generation and manipulation tasks at high resolutions and diverse conditions. Theoretically, it highlights the potential of pretrained GAN models in open-world scenarios and multifaceted input settings, showcasing possibilities for further development in cross-modal image synthesis.

Future Directions

Future research should address the limitations related to GAN's inherent biases and the high computational costs of instance-specific optimizations. Extending the framework to support broader classes and enhance real-time efficiency remains a promising area for exploration.

Overall, TediGAN presents a compelling approach to unified image synthesis and manipulation, emphasizing the synergy between pretrained models and multimodal inputs. The framework provides insightful pathways for leveraging GANs in generating and modifying images rich in detail and semantic coherence.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Weihao Xia (26 papers)
Yujiu Yang (155 papers)
Jing-Hao Xue (54 papers)
Baoyuan Wu (107 papers)

Citations (35)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - IIGROUP/TediGAN: [CVPR 2021] Pytorch implementation for TediGAN: Text-Guided Diverse Face Image Generation and Manipulation (389 stars)