Generating Diverse High-Fidelity Images with VQ-VAE-2 (1906.00446v1)

Published 2 Jun 2019 in cs.LG, cs.CV, and stat.ML

Abstract: We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE) models for large scale image generation. To this end, we scale and enhance the autoregressive priors used in VQ-VAE to generate synthetic samples of much higher coherence and fidelity than possible before. We use simple feed-forward encoder and decoder networks, making our model an attractive candidate for applications where the encoding and/or decoding speed is critical. Additionally, VQ-VAE requires sampling an autoregressive model only in the compressed latent space, which is an order of magnitude faster than sampling in the pixel space, especially for large images. We demonstrate that a multi-scale hierarchical organization of VQ-VAE, augmented with powerful priors over the latent codes, is able to generate samples with quality that rivals that of state of the art Generative Adversarial Networks on multifaceted datasets such as ImageNet, while not suffering from GAN's known shortcomings such as mode collapse and lack of diversity.

PDF Abstract

Generating Diverse High-Fidelity Images with VQ-VAE-2

Generating Diverse High-Fidelity Images with VQ-VAE-2 explores the application of Vector Quantized Variational AutoEncoders (VQ-VAE-2) for large-scale image generation. By enhancing and scaling the autoregressive priors used within VQ-VAE, this paper achieves greater coherence and fidelity in generated images than previously possible. The authors utilize straightforward feed-forward encoder and decoder architectures that make VQ-VAE models well-suited for scenarios where encoding and decoding speeds are critical. More importantly, VQ-VAE's approach to sampling in a compressed latent space rather than pixel space increases speed by an order of magnitude when generating large images.

Introduction

The paper commences by contextualizing the evolution of deep generative models, which have improved substantially due to advancements in architecture and computational power. Applications of such models span super-resolution, domain editing, artistic manipulation, and generative tasks in text, speech, and music.

Generative models are primarily divided into likelihood-based methods (e.g., VAEs, flow-based models, autoregressive models) and implicit generative models like GANs. GANs have shown remarkable results in high-quality, high-resolution image generation. However, they suffer from issues like mode collapse and evaluation challenges. Likelihood-based methods, on the other hand, provide better coverage of data distributions and are easier to evaluate but struggle with direct pixel space maximization and sample quality.

Model Architecture

VQ-VAE models encode images into a discrete latent space using vector quantization of intermediate autoencoder representations. The priors over these discrete representations are modeled using a PixelCNN with self-attention mechanisms, known as PixelSNAIL.

VQ-VAE Mechanism

The VQ-VAE framework can be likened to a communication system involving an encoder and a decoder with a shared codebook. The encoder translates input data into discrete latent vectors. These vectors are quantized by finding the closest prototype vector within the codebook. The decoder reconstructs the input data using these quantized vectors. The learning of this mapping involves minimizing the reconstruction error with additional loss terms to align the latent space and induce commitment by the encoder. Exponential moving average updates help in stabilizing and optimizing the codebook.

Hierarchical VQ-VAE

The paper introduces a hierarchical VQ-VAE to model large images effectively. Two levels of latent hierarchies separate global information (e.g., shapes) and local information (e.g., textures). The encoder transforms the image into lower resolution latent maps, which are then passed to a multi-scale feed-forward decoder for reconstruction.

PixelCNN Priors

To enhance image generation, priors over the latent codes are modeled using PixelCNN networks. The top-level prior captures global structures using layers with self-attention, while the bottom-level prior handles local details. This decomposition allows training larger models and distributing computational resources effectively, leading to realistic sample reconstruction.

Experimental Insights

Quantitative Evaluation

The paper presents an array of metrics for evaluating model performance, including NLL, Mean Squared Error (MSE), Precision-Recall, Classification Accuracy Score (CAS), and the widely recognized IS and FID. The model effectively balances sample quality and diversity. NLL values for VQ-VAE priors show minimal differences between training and validation sets, indicating robust generalization without overfitting.

High-Resolution Image Generation

VQ-VAE-2's capabilities are tested on high-resolution face images from the FFHQ dataset at 1024x1024 resolution. The model successfully captures long-range dependencies, maintaining consistent attributes like eye color, which is challenging due to the large spatial distances.

Comparison with State-of-the-Art Methods

By benchmarking against BigGAN, the proposed hierarchical VQ-VAE demonstrates competitive generation quality and greater diversity in various classes. Precision-Recall metrics confirm the model's adeptness at covering data distribution modes while maintaining high sample quality.

Conclusion

This paper emphasizes VQ-VAE-2's efficacy for generating high-resolution images with improved fidelity and diversity. The hierarchical latent structure and powerful prior models enhance VQ-VAE's overall performance, underscoring its potential for diverse applications in image generation tasks. Future work might explore further scaling and enhancing priors or adapting the approach to other data modalities and tasks, paving the way for more versatile and efficient generative models.