Vector-quantized Image Modeling with Improved VQGAN (2110.04627v3)

Published 9 Oct 2021 in cs.CV and cs.LG

Abstract: Pretraining LLMs with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at (256\times256) resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. VIM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.

PDF Abstract

Vector-Quantized Image Modeling with Improved VQGAN

The paper "Vector-quantized Image Modeling with Improved VQGAN" offers a thorough investigation into the field of image modeling using a two-stage approach known as Vector-quantized Image Modeling (VIM). This research is particularly notable for its strategic advancements over the traditional VQGAN framework. Such advancements aim to improve both the efficiency and quality of vector-quantized image tasks through architectural innovations and refined codebook learning strategies.

Core Contributions

The main focus of the research is to enhance the existing VQGAN model by integrating Vision Transformers (ViTs) in both the encoding and decoding processes, thereby leading to a model termed ViT-VQGAN. This integration leverages the ViT's capability to process large-scale data efficiently, which is advantageous compared to conventional convolutional neural networks. The paper introduces several improvements over the baseline VQGAN:

ViT-Enhanced Architecture: By replacing the CNN components of VQGAN with ViT structures, the model significantly boosts computational efficiency and improves image reconstruction quality.
Enhanced Codebook Learning: The paper emphasizes better codebook usage via factorized codes and $\ell_2$ -normalized codes, which facilitate efficient training and enhance the diversity of codified representations.
Diverse and Effective Loss Functions: The experiment utilized a combination of logit-laplace loss, $\ell_2$ loss, perceptual loss, and adversarial loss, tuned through hyper-parameter sweeps to optimize performance on varying datasets, such as ImageNet, CelebA-HQ, and FFHQ.

The experiments yielded strong quantitative results, notably surpassing previous models like the vanilla VQGAN on standard metrics such as Inception Score (IS) and Fréchet Inception Distance (FID). For instance, when trained on ImageNet, the improved model achieved an IS of 175.1 and an FID of 4.17, compared to the baseline IS of 70.6 and FID of 17.04, highlighting the substantial performance uplift imparted by the improvements.

Implications and Future Directions

The advancements introduced in this work emphasize the potential of ViT-based architectures for tasks formerly dominated by CNNs, particularly in the domain of high-resolution image synthesis and understanding. The efficiencies demonstrated by the ViT-VQGAN can significantly reduce the computational burden, enabling the handling of more complex image datasets and modeling scenarios.

Moreover, the insights gained regarding codebook learning could spur further research into optimizing quantized representations in other visual tasks, potentially extending into video modeling or 3D image synthesis. The paper’s experimental results with linear probes also point to the efficacy of the model in differentiative pretraining, suggesting broader applicability in varied domains.

While the work makes substantial strides in improving image modeling efficacy, it also underscores the intrinsic challenges linked to biases within widely-used datasets such as CelebA-HQ and ImageNet. The development of more controlled and ethical datasets remains an important avenue for future exploration to mitigate biases influencing model outcomes.

Conclusion

This paper presents a significant step forward in the domain of vector-quantized image modeling. By employing transformer models within the VQGAN framework, the researchers have successfully demonstrated enhancements in image quality, efficiency, and representation capabilities. Future research based on these findings could focus on extending these improvements to broader applications and addressing ethical concerns in dataset and model biases.