CogView: Mastering Text-to-Image Generation via Transformers (2105.13290v3)

Published 26 May 2021 in cs.CV and cs.LG

Abstract: Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

PDF Abstract

CogView: Mastering Text-to-Image Generation via Transformers

The paper "CogView: Mastering Text-to-Image Generation via Transformers" introduces CogView, a 4-billion-parameter Transformer model designed to enhance text-to-image generation, a challenging task within the field of machine learning that requires robust generative models alongside cross-modal understanding. The work builds upon recent advancements in auto-regressive generative models and the Vector Quantized Variational AutoEncoders (VQ-VAE) framework to achieve superior results over previous models, including GAN-based approaches and DALL-E.

Methodology

CogView employs a Transformer architecture combined with a VQ-VAE tokenizer to compress images into discrete token sequences. The model is trained on an extensive dataset of 30 million high-quality Chinese text-image pairs. Innovations in the training process include techniques like Precision Bottleneck Relaxation and Sandwich Layernorm to stabilize training, which is notably challenging due to the heterogeneity of the dataset and precision issues inherent in large-scale models.

Key Findings

Performance Metrics: CogView distinguishes itself by achieving the best Fréchet Inception Distance (FID) scores on the MS COCO dataset, confirming its superior image generation quality compared to predecessors, including DALL-E.
Versatility in Downstream Tasks: Beyond basic generation, the model demonstrates adaptability across various tasks such as style learning, super-resolution, image captioning, and text-image reranking. This versatility is largely facilitated by post-training finetuning procedures.
Novel Techniques for Stability: The introduction of PB-relaxation and Sandwich Layernorm significantly contributes to stabilizing the training process of large Transformers, preemptively addressing overflow issues characterized by NaN losses.

Implications and Future Directions

The success of CogView in text-to-image generation has noteworthy implications for both practical applications and theoretical advancements. In practical terms, the model can be utilized for style-based image synthesis, industrial fashion design, and enhancement of image resolution, offering tools for creative industries and commercial enterprises. Theoretically, the exploration of stability techniques like Sandwich-LN might inspire further research into efficient training of large-scale generative models, potentially paving the way for even more powerful AI architectures.

Looking forward, challenges such as the inherently high computational cost associated with the autoregressive approach and the blurriness from the VQ-VAE compression still need to be addressed. Simplifying these processes without compromising output quality will be a critical area of research.

Conclusion

CogView represents a substantial advancement in text-to-image generation technology by integrating the strengths of modern Transformer architectures with innovative stabilization techniques. Its ability to outperform existing models on complex datasets like MS COCO and adapt to various downstream tasks positions CogView as a significant contribution to AI-driven generative art and beyond, with applications extending into multiple creative and commercial domains. Nonetheless, ongoing work will need to focus on optimizing its computational efficiency and tackling the nuanced blurriness in generated outputs.