Cross-Modal Contrastive Learning for Text-to-Image Generation (2101.04702v5)

Published 12 Jan 2021 in cs.CV

Abstract: The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and intra-modality correspondences. XMC-GAN uses an attentional self-modulation generator, which enforces strong text-image correspondence, and a contrastive discriminator, which acts as a critic as well as a feature encoder for contrastive learning. The quality of XMC-GAN's output is a major step up from previous models, as we show on three challenging datasets. On MS-COCO, not only does XMC-GAN improve state-of-the-art FID from 24.70 to 9.33, but--more importantly--people prefer XMC-GAN by 77.3 for image quality and 74.1 for image-text alignment, compared to three other recent models. XMC-GAN also generalizes to the challenging Localized Narratives dataset (which has longer, more detailed descriptions), improving state-of-the-art FID from 48.70 to 14.12. Lastly, we train and evaluate XMC-GAN on the challenging Open Images data, establishing a strong benchmark FID score of 26.91.

PDF Abstract

Cross-Modal Contrastive Learning for Text-to-Image Generation

The research paper titled "Cross-Modal Contrastive Learning for Text-to-Image Generation" presents a notable paper on enhancing the quality and semantic fidelity of text-to-image synthesis by introducing a novel approach called the Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN). This method addresses the critical challenge of text-to-image synthesis: creating images that are not only photo-realistic but also highly coherent with their textual descriptions.

Key Contributions

Innovative Contrastive Learning Approach: The paper leverages cross-modal contrastive learning to bridge the semantic gap between text and image modalities. XMC-GAN maximizes mutual information between text and image pairs through multiple contrastive losses, addressing both inter-modality (image to sentence) and intra-modality (image to image) correspondences.
Attentional Self-Modulation Generator: The proposed generator employs self-modulation with attention mechanisms to enforce high correspondence between text and generated image features. This novel architecture enhances the ability of the model to produce detailed and coherent images based on the provided textual input.
Superior Performance on Challenging Datasets: The model exhibits significant performance improvements across multiple datasets. Notably, on the MS-COCO dataset, XMC-GAN reduced the state-of-the-art Fréchet Inception Distance (FID) from 24.70 to 9.33—a substantial leap forward. Human evaluations further substantiate these improvements, with a majority preference for XMC-GAN's image quality and text alignment over previous models.
Benchmarking on Localized Narratives and Open Images: The paper extends its evaluation to the Localized Narratives and Open Images datasets, which pose additional challenges due to longer and more descriptive captions. XMC-GAN sets a new benchmark on these datasets with an FID of 14.12 for LN-COCO and establishes a strong baseline with an FID of 26.91 for LN-OpenImages.

Implications and Future Directions

The research showcases the potential of leveraging contrastive learning within the domain of conditional GANs to capture intricate semantic details and enhance the realism of generated images. This approach not only sets new standards in text-to-image synthesis but also opens avenues for further exploration into cross-modal generative tasks.

The integration of contrastive learning strategies signals a shift towards more sophisticated models that inherently understand and generate complex scenes, paving the way for future advancements in creative AI applications. However, the exploration of more diverse datasets and the impact of additional modalities (e.g., audio) remain compelling areas for future research.

By focusing on mutual information maximization through contrastive learning, the paper highlights a promising direction for achieving higher levels of semantic alignment and image fidelity in generative models. This work could inform the development of more generalized frameworks capable of handling the nuanced interdependencies between different data modalities in multimodal AI tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Han Zhang (338 papers)
Jing Yu Koh (18 papers)
Jason Baldridge (45 papers)
Honglak Lee (174 papers)
Yinfei Yang (73 papers)

Citations (333)

View on Semantic Scholar

Cross-Modal Contrastive Learning for Text-to-Image Generation (2101.04702v5)