Controllable Text-to-Image Generation (1909.07083v2)

Published 16 Sep 2019 in cs.CV, cs.CL, and cs.LG

Abstract: In this paper, we propose a novel controllable text-to-image generative adversarial network (ControlGAN), which can effectively synthesise high-quality images and also control parts of the image generation according to natural language descriptions. To achieve this, we introduce a word-level spatial and channel-wise attention-driven generator that can disentangle different visual attributes, and allow the model to focus on generating and manipulating subregions corresponding to the most relevant words. Also, a word-level discriminator is proposed to provide fine-grained supervisory feedback by correlating words with image regions, facilitating training an effective generator which is able to manipulate specific visual attributes without affecting the generation of other content. Furthermore, perceptual loss is adopted to reduce the randomness involved in the image generation, and to encourage the generator to manipulate specific attributes required in the modified text. Extensive experiments on benchmark datasets demonstrate that our method outperforms existing state of the art, and is able to effectively manipulate synthetic images using natural language descriptions. Code is available at https://github.com/mrlibw/ControlGAN.

Authors (4)

Bowen Li (166 papers)
Xiaojuan Qi (133 papers)
Thomas Lukasiewicz (125 papers)
Philip H. S. Torr (219 papers)

Citations (321)

View on Semantic Scholar

Summary

The paper "Controllable Text-to-Image Generation" presents a novel method named ControlGAN designed to address the challenge of generating realistic images that can be selectively manipulated based on textual descriptions. The core idea is to introduce a level of control in the text-to-image generation process, allowing for adjustments to specific attributes of the synthesized images in response to modifications in the text, without altering other unrelated parts of the image.

Key Innovations and Methodology

ControlGAN Architecture:
- The proposed method, ControlGAN, is based on a Generative Adversarial Network (GAN) framework, specifically designed to enable control over image attributes. The model consists of a generator and discriminator with several novel features tailored to enhance control and quality.
Word-Level Spatial and Channel-Wise Attention:
- A unique component of ControlGAN is its attention mechanism. The generator employs a word-level spatial and channel-wise attention module that helps disentangle various visual attributes. This allows the system to isolate and focus on generating subregions of the image that are most relevant to specific word descriptions. The channel-wise attention extends beyond typical spatial attention by also considering the correlation of textual words with channel features, which are critical for capturing detailed visual semantics.
Word-Level Discriminator:
- ControlGAN introduces a word-level discriminator that evaluates the correlation between each word in the text and specific regions in the image. This provides more fine-grained feedback, which enhances the generator's ability to modify specific visual attributes while maintaining the integrity of unmodified elements.
Perceptual Loss:
- To minimize random variations and maintain semantic consistency, perceptual loss is incorporated. This helps ensure that images generated from texts are not only realistic but also match the intended descriptions by leveraging pre-existing deep visual features.

Experimental Validation

ControlGAN's effectiveness is validated through extensive experiments on well-known datasets such as the CUB bird dataset and the Microsoft COCO dataset. Results showcase the method's superiority over state-of-the-art techniques in both image quality and the precision of controlled modifications:

Image Quality and Alignment: Compared to previous models like StackGAN++ and AttnGAN, ControlGAN demonstrates superior performance in generating high-quality images that remain consistent with textual input even as the text changes.
Controllable Manipulation: A key metric is the ability of the model to maintain image attributes not related to the modified text. The proposed architecture significantly reduces the $L_2$ reconstruction error, meaning it better retains unmodified content compared to other models.

Conclusion and Implications

ControlGAN stands out by providing users with a high degree of control over specific image attributes using natural language. This advancement has potential applications in several areas, including customized image editing, creative design, and any domain requiring detailed, user-guided visual content creation. The integration of sophisticated attention mechanisms and perceptual constraints paves the way for future innovations in controllable text-to-image synthesis, making it a significant contribution to the field of generative AI.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - mrlibw/ControlGAN: Pytorch implementation for Controllable Text-to-Image Generation. (165 stars)