Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NVAE: A Deep Hierarchical Variational Autoencoder (2007.03898v3)

Published 8 Jul 2020 in stat.ML, cs.CV, and cs.LG

Abstract: Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, CelebA 64, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256$\times$256 pixels. The source code is available at https://github.com/NVlabs/NVAE .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Arash Vahdat (69 papers)
  2. Jan Kautz (215 papers)
Citations (827)

Summary

Nouveau VAE: A Deep Hierarchical Variational Autoencoder

Introduction

Variational Autoencoders (VAEs) have established themselves as significant players in the field of deep generative models. However, despite their advantages such as efficient sampling and convenient encoding networks, VAEs frequently underperform in comparison with alternatives like normalizing flows and autoregressive models. This paper, "Nouveau VAE (NVAE): A Deep Hierarchical Variational Autoencoder" by Vahdat and Kautz, represents a departure from the focus on statistical improvements that typify most VAE research, pivoting instead to the design of neural architectures aimed at enhancing VAE performance in image generation tasks.

Architectural Innovation and Methods

NVAE employs a deep hierarchical VAE designed with depthwise separable convolutions and batch normalization (BN), which serve as the backbone of its generative model. Unlike prior VAE models, NVAE is stabilized using spectral regularization. Additionally, it leverages a residual parameterization of the Normal distributions, facilitating the mitigation of inherent training instabilities associated with hierarchical structures.

Key architectural innovations of NVAE include:

  1. Depthwise Separable Convolutions: These convolutions rapidly expand the network's receptive field without substantially increasing the parameter count.
  2. Batch Normalization: Contrary to previous state-of-the-art VAE models that avoid BN due to its instability during evaluation, NVAE demonstrates that careful tuning of BN parameters is critical to achieving stable and performant training.
  3. Residual Normal Distributions: These introduce a residual component relative to the prior, simplifying the minimization of the Kullback-Leibler divergence term and thereby enhancing training stability.
  4. Spectral Regularization: A technique used to regularize network smoothness, significantly improving training stability for deep hierarchical VAEs.

Experimental Validation

NVAE achieves competitive performance across multiple datasets including MNIST, CIFAR-10, CelebA 64, and CelebA HQ. In particular, the model demonstrates state-of-the-art results among non-autoregressive likelihood-based generative models. It reduces the bits-per-dimension metric to 2.91 on CIFAR-10, surpassing previous benchmarks. Moreover, NVAE effectively scales to generate high-quality images of up to 256x256 pixels, a notable accomplishment for VAEs.

Ablative Analyses

Several ablation studies further elucidate the contributions of individual components:

  • Normalization and Activation Functions: Experiments reveal that BN combined with the Swish activation function outperforms alternative configurations like weight normalization (WN) with ELU.
  • Residual Cells: Evaluating the placement of depthwise separable cells within the encoder and generative models, the paper finds a significant advantage in using these cells in the generative model but not in the bottom-up encoder.
  • Residual Distributions: The introduction of residual distributions enhances the encoder's effectivity in keeping KL divergence manageable, thus improving test log-likelihood scores.

Implications and Future Work

The implications of NVAE are both practical and theoretical. Practically, the model's architecture advances facilitate the generation of high-quality, high-resolution images, pushing the boundaries of what is achievable via VAEs. Theoretically, the research underscores the underappreciated importance of network architecture in overcoming the statistical challenges traditionally associated with VAEs.

NVAE opens avenues for future developments such as integrating more complex flows or exploring neural architecture search techniques to automate the design process. Additionally, the effect of batch normalization on model performance warrants deeper investigation.

Conclusion

NVAE represents a significant stride in VAE research by demonstrating that architectural design can substantially enhance VAE performance. This focus on network architecture, coupled with innovative solutions to training stability and scalability, positions NVAE as a key benchmark in the ongoing development of deep generative models. This paper provides a compelling case for redirecting some research efforts towards the neural architecture design for VAEs, potentially unlocking new capabilities and applications within the field of deep generative learning.