Semantics Disentangling for Text-to-Image Generation (1904.01480v1)

Published 2 Apr 2019 in cs.CV

Abstract: Synthesizing photo-realistic images from text descriptions is a challenging problem. Previous studies have shown remarkable progresses on visual quality of the generated images. In this paper, we consider semantics from the input text descriptions in helping render photo-realistic images. However, diverse linguistic expressions pose challenges in extracting consistent semantics even they depict the same thing. To this end, we propose a novel photo-realistic text-to-image generation model that implicitly disentangles semantics to both fulfill the high-level semantic consistency and low-level semantic diversity. To be specific, we design (1) a Siamese mechanism in the discriminator to learn consistent high-level semantics, and (2) a visual-semantic embedding strategy by semantic-conditioned batch normalization to find diverse low-level semantics. Extensive experiments and ablation studies on CUB and MS-COCO datasets demonstrate the superiority of the proposed method in comparison to state-of-the-art methods.

Authors (6)

Guojun Yin (19 papers)
Bin Liu (441 papers)
Lu Sheng (63 papers)
Nenghai Yu (173 papers)
Xiaogang Wang (230 papers)
Jing Shao (109 papers)

Citations (170)

View on Semantic Scholar

Summary

Semantics Disentangling for Text-to-Image Generation: A Detailed Overview

The paper "Semantics Disentangling for Text-to-Image Generation" introduces an innovative framework called Semantics Disentangling Generative Adversarial Network (SD-GAN) aimed at improving the quality and consistency of images generated from text descriptions. This research addresses the inherent challenges posed by diverse linguistic expressions and seeks to enhance both high-level semantic consistency and low-level semantic diversity in text-to-image generation.

Methodology Overview

The SD-GAN leverages a novel integration of a Siamese mechanism with contrastive losses in the discriminator, coupled with a Semantic-Conditioned Batch Normalization (SCBN) in the generator. These components play pivotal roles in disentangling semantics from input text, ensuring that generated images maintain semantic relevance despite variations in textual descriptions.

Siamese Mechanism: The Siamese structure in the discriminator is employed to distill high-level semantic consistency. By using text pairs from the same ground-truth image (intra-class) or different images (inter-class), the Siamese architecture aims to maintain consistency across generated images even with expressive text variations. The discriminator is treated as an image comparator, trained with contrastive losses to minimize the distance between images generated from semantically similar descriptions, while maximizing distances for dissimilar ones.
Semantic-Conditioned Batch Normalization (SCBN): SCBN enhances visual-semantic embedding by embedding sentence-level and word-level text features into the batch normalization process of the generator. This mechanism enables fine-grained manipulation of visual features using semantic information extracted from text, thereby retaining low-level semantic diversity necessary for detailed image synthesis.

Experimental Results

The superiority of SD-GAN is evident through extensive experimentation on CUB and MS-COCO datasets. The SD-GAN achieves an inception score of 4.67 on CUB and 35.69 on MS-COCO, surpassing previous models like AttnGAN. These results underscore SD-GAN as a significant advancement in text-to-image synthesis.

Furthermore, human evaluations revealed that SD-GAN-generated images are often preferred due to their semantic alignment with text descriptions. Users ranked SD-GAN higher, confirming its effectiveness in producing visually coherent and semantically congruent images.

Implications and Future Directions

The approach proposed in this paper reflects a substantial progression in aligning generated visual content with textual semantics. It opens avenues for pursuing more sophisticated models that can further enhance the fidelity of generated images by successfully interpreting complex linguistic inputs.

The proposed framework holds promise for practical applications in media, advertising, and personalized content creation, where automatic rendering of images from text can be vital. From a theoretical standpoint, the fusion of cross-modal semantic disentangling techniques could inspire further research into multi-modal generative models, potentially leading to breakthroughs in areas like automatic video generation from scripts.

Future research may delve into more complex scenarios involving multi-sentence or paragraph-level text inputs, enabling richer, contextually grounded image synthesis. Additionally, exploring the scalability of this framework to less constrained datasets with greater diversity in image types and descriptions could broaden the applicability of text-to-image models in real-world deployments.

In conclusion, the "Semantics Disentangling for Text-to-Image Generation" paper presents a methodologically robust and empirically validated approach to text-to-image synthesis. By effectively addressing semantic consistency and diversity, SD-GAN sets a new standard in generating visually appealing and semantically accurate images from textual descriptions, providing a foundation for future advancements in the field.

Related Papers

Find Related Papers