Zero-Shot Text-to-Image Generation (2102.12092v2)

Published 24 Feb 2021 in cs.CV and cs.LG

Abstract: Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

PDF Abstract

Zero-Shot Text-to-Image Generation

Introduction

This paper, authored by Aditya Ramesh et al., explores a novel approach to text-to-image generation based on an autoregressive transformer model. Unlike traditional methods that rely on specific modeling heuristics or auxiliary information, this research aims to unify the text and image tokens into a single data stream, modeling them autoregressively. The methodological transition implicates a considerable scale in both data and computational resources, making this approach competitive with existing domain-specific models when evaluated in zero-shot settings.

Related Work

Early attempts at text-to-image generation approached the problem through various generative models. Mansimov et al. (2015) extended the DRAW model to condition on image captions, while Reed et al. (2016) utilized GANs to enhance image fidelity and generalize to unseen categories. Subsequent improvements included multi-scale generators, attention mechanisms, and the incorporation of side-information, as seen in works by Zhang et al. (2017, 2018) and Xu et al. (2018). More recent methodologies, such as those proposed by Nguyen et al. (2017), leveraged pretrained discriminative models within an energy-based framework to achieve higher image quality.

Methodology

The paper outlines a two-stage training procedure addressing the constraints in high-resolution image modeling.

Stage One: Learning the Visual Codebook
- A discrete variational autoencoder (dVAE) is trained to compress each $256 \times 256$ RGB image into a $32 \times 32$ grid of image tokens. This grid utilizes a large vocabulary size of 8192 to minimize information loss.
- The dVAE training maximizes the Evidence Lower Bound (ELB) using an Adam optimizer. Certain hyperparameters, including the relaxation temperature and step size, are annealed following a cosine schedule to ensure stable training.
Stage Two: Learning the Prior
- Approximately 250 million text-image pairs are collected and used to train a 12-billion parameter sparse transformer model.
- Images are encoded into 1024 image tokens and are concatenated with up to 256 text tokens, jointly modeled autoregressively.
- The model utilizes different self-attention masks, including row and column masks, to handle text-to-text, image-to-text, and image-to-image attention.

Results and Evaluation

The research presents substantial zero-shot performance on the MS-COCO dataset. A comparison with previous models shows strong quantitative results:

The model achieves a Fréchet Inception Distance (FID) score within 2 points of the best prior approaches, despite not being trained on those specific captions.
In human evaluations, the generated images were preferred over prior models 90% of the time for realism and 93% for matching captions.

Moreover, the paper highlights the model's capacity for complex tasks like zero-shot image-to-image translation, a feature traditionally requiring dedicated architectures. This capability emphasizes the versatility and potential scalability benefits of the proposed method.

Implications and Future Work

The findings implicate significant theoretical and practical implications.

Theoretical Implications:
- The integration of text and image tokens into a unified stream simplifies the model architecture, possibly inspiring future research to extend this approach to other multimodal tasks.
- The ability for combinatorial and abstract generalization hints at the underlying model's potential for high-level cognitive tasks.
Practical Implications:
- The model's zero-shot performance without needing fine-tuning on specific datasets like MS-COCO suggests a massive efficiency improvement in real-world applications.
- This approach could reduce the need for large, labeled datasets, making advanced generative models more accessible to different domains with limited data.

Future Developments:

The exploration into fine-tuning remains a promising direction, especially for specialized datasets that demonstrated inferior performance in the paper.
Further research could investigate the scalability limits and potential optimizations to mitigate the computational demands of large autoregressive transformers.

Conclusion

This paper demonstrates a pioneering advancement in text-to-image generation via a transformer-based approach, capable of performing competitively in a zero-shot setting. The amalgamation of text and image tokens and the model's adaptability point toward a productive avenue for future research in large-scale multimodal learning. While challenges remain in computational efficiency, the implications for both practical applications and theoretical advancements are profound, underscoring the immense potential of scaling contemporary neural architectures.

While the paper underscores the necessity for large datasets and extensive computational resources, the demonstrated performance gains and the emergent capabilities suggest that scaling could be a pivotal factor in further advancing text-to-image generative tasks.