Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images (2011.10650v2)

Published 20 Nov 2020 in cs.LG and cs.CV

Abstract: We present a hierarchical VAE that, for the first time, generates samples quickly while outperforming the PixelCNN in log-likelihood on all natural image benchmarks. We begin by observing that, in theory, VAEs can actually represent autoregressive models, as well as faster, better models if they exist, when made sufficiently deep. Despite this, autoregressive models have historically outperformed VAEs in log-likelihood. We test if insufficient depth explains why by scaling a VAE to greater stochastic depth than previously explored and evaluating it CIFAR-10, ImageNet, and FFHQ. In comparison to the PixelCNN, these very deep VAEs achieve higher likelihoods, use fewer parameters, generate samples thousands of times faster, and are more easily applied to high-resolution images. Qualitative studies suggest this is because the VAE learns efficient hierarchical visual representations. We release our source code and models at https://github.com/openai/vdvae.

Authors (1)

Rewon Child (10 papers)

Citations (316)

View on Semantic Scholar

Summary

Hierarchical Variational Autoencoder Network Architecture and Training Strategies

The research paper presents a modified hierarchical Variational Autoencoder (VAE) architecture with an emphasis on optimizing both network complexity and training efficiency. The proposed model integrates a deterministic encoder, a stochastic inference network, and a generator/prior network, innovatively structured to handle multi-resolution data efficiently.

Network Architecture

The network mechanics rely on a hierarchical top-down approach drawing from prior works by Sonderby et al. and Kingma et al., aiming to refine the utility and efficiency of hierarchical VAEs. Notably, the architecture accepts data at a high resolution, progressively downscaling through convolution operations to a minimal 1x1 resolution. Conversely, the generator network reconstructs data from the minimized dimensions back to the original resolution.

Key components of the architecture include:

Encoder: This module constructs activations across various resolutions using bottleneck residual blocks. By leveraging 3x3 convolutions and GELU nonlinear activations, the encoder efficiently downscales data while maintaining robust feature extraction.
Inference and Generator Networks: These networks share parameters to reduce model complexity. They are designed to combine the approximate posterior and prior distributions, improving the model's ability to learn expressive latent representations.
Parameter Sharing: A significant feature of the architecture is the re-use of posterior network parameters for prior distribution generation, a technique enhancing parameter efficiency and potentially boosting performance.

Training Procedure

The paper outlines a sophisticated training methodology, focusing on stability and convergence:

Loss Adjustment: The researchers modify the KL divergence-based loss, initially training the posterior against a standard normal distribution. This adjustment seeks to stabilize parameter updates, shifting focus towards convergence as training progresses.
Optimization Strategies: The use of Adam and AdamW optimizers is discussed alongside techniques such as softplus activations and gradient norm clipping to manage potential instabilities.
Skip Gradient Updates: A notable pragmatic choice is skipping gradient updates with excessively high magnitudes, maintaining training smoothness.

The networks are evaluated using data from CIFAR-10, ImageNet-32, and ImageNet-64 datasets. An important methodological choice noted is the use of Polyak averaging of training weights during evaluation, enhancing stability.

Hyperparameter Specification

The paper meticulously documents the hyperparameters employed in their experiments, detailing configurations for different datasets:

For CIFAR-10, configuration includes a network width of 384, a structured block resolution from high to low, and training over 650 epochs.
ImageNet configurations demonstrate scalability, with network widths spanning up to 1024 and larger training durations on multiple GPUs, highlighting the compute-intensive nature of the high-resolution datasets.

Implications and Future Directions

The architecture and training approaches proposed showcase an enhanced balance between complexity and expressiveness in hierarchical VAEs. By focusing on parameter sharing and structured multi-resolution processing, the model provides a scalable approach for handling complex datasets. Future research could explore further optimization of parameter efficiency or integrating alternative nonlinear activation functions to further enhance model robustness. In practice, these methodologies have potential implications in domains requiring scalable generative modeling, such as image synthesis and anomaly detection.

In conclusion, this paper contributes to the ongoing discourse on hierarchical VAEs by proposing an architecture that judiciously integrates components for efficient scalability and training, underlining its applicability in AI tasks requiring nuanced operational attention to both architectural design and training methodologies.

PDF Markdown

Related Papers

NVAE: A Deep Hierarchical Variational Autoencoder (2020)
PixelVAE: A Latent Variable Model for Natural Images (2016)
Generating Diverse High-Fidelity Images with VQ-VAE-2 (2019)
Optimizing Hierarchical Image VAEs for Sample Quality (2022)
Efficient-VDVAE: Less is more (2022)

GitHub

GitHub - openai/vdvae: Repository for the paper "Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images" (435 stars)