Papers
Topics
Authors
Recent
Search
2000 character limit reached

VQ-VAE-2: Multi-Scale Hierarchical Generative Models

Updated 2 June 2026
  • VQ-VAE-2 is a multi-scale hierarchical generative model that applies vector quantization to continuous latent spaces for efficient, high-fidelity image reconstruction.
  • It leverages multiple quantization layers and autoregressive priors to capture both global structures and local details in complex data distributions.
  • Advancements such as improved codebook utilization and Bayesian training regimes mitigate collapse and boost sample diversity, making it competitive with GANs.

Vector Quantized Variational Autoencoders (VQ-VAE-2) are multi-scale hierarchical generative models that use vector quantization on continuous latent representations within a variational autoencoding framework. Designed to capture both global and local dependencies in data, VQ-VAE-2 leverages discrete latent variables and autoregressive priors for high-fidelity image generation and efficient encoding-decoding. The method is built upon the original VQ-VAE architecture but introduces multiple quantization layers associated with distinct spatial scales, enabling reconstruction of complex data distributions with improved mode coverage and sample diversity compared to earlier VAEs and GANs (Razavi et al., 2019).

1. Multi-Scale Hierarchical Architecture

VQ-VAE-2 implements a hierarchical encoder-decoder structure. For image data (e.g., ImageNet 256×256), it utilizes two or more quantization layers operating at progressively coarser spatial resolutions. The process is as follows (Razavi et al., 2019, Takida et al., 2023):

  • The input xx is processed by a bottom-up stack of convolutional and pooling layers to generate feature maps:
    • Bottom-level map: hboth_{\text{bot}} (e.g., 64×6464\times64 for ImageNet 256)
    • Top-level map: htoph_{\text{top}} (e.g., 32×3232\times32)
  • Each level yields a continuous embedding, which is quantized via a codebook:

zql(i)=argminjzel(i)ejl22where l{bot,top}z^l_q(i) = \arg\min_j \|z^l_e(i) - e^l_j\|_2^2 \quad\text{where } l\in\{\text{bot},\text{top}\}

and ejle^l_j denotes the jj-th entry in codebook ElE_l of size KlK_l with embedding dimension hboth_{\text{bot}}0.

  • The decoder reconstructs hboth_{\text{bot}}1 by combining quantized embeddings from all layers:
    • A coarse reconstruction is first formed from hboth_{\text{bot}}2, then hboth_{\text{bot}}3 is injected (e.g., via concatenation) before the final reconstruction hboth_{\text{bot}}4.

In three-level hierarchies (e.g., FFHQ 1024×1024), the procedure is recursively extended to lower-resolution code maps.

2. Vector Quantization, Loss Functions, and Training

Each encoder output is quantized to the nearest codebook vector, yielding discrete latents. The model is trained using losses at every scale (Razavi et al., 2019, Takida et al., 2023):

  • Reconstruction loss:

hboth_{\text{bot}}5

  • Codebook update ("push" loss):

hboth_{\text{bot}}6

hboth_{\text{bot}}7

where hboth_{\text{bot}}8 denotes the stop-gradient operator.

The total objective for VQ-VAE-2 is a sum of these terms over all levels and spatial locations. In practice, exponential moving average (EMA) updates are often used for codebooks to improve stability.

VQ-VAE-2 codebooks commonly use hboth_{\text{bot}}9 embeddings of 64×6464\times640 dimensions per level, with commitment 64×6464\times641.

3. Autoregressive Priors and Sampling

The generative capability of VQ-VAE-2 is achieved by fitting autoregressive priors (PixelCNN or PixelSnail) over the discrete latent maps (Razavi et al., 2019):

  • The top-level code map prior 64×6464\times642 is modeled by a PixelSnail or PixelCNN with gated convolutions and self-attention on the low-resolution space.
  • The bottom-level code map prior 64×6464\times643 is a conditional PixelCNN, conditioned on the top-level code map.
  • The full prior factorizes as

64×6464\times644

  • After training, new samples are generated by sequentially sampling the top code map, then the bottom (conditioned), followed by one-shot feed-forward decoding to 64×6464\times645.

This design substantially accelerates sampling compared to pixel-space autoregressive models and allows capturing both global and local image structures.

4. Codebook Collapse and Hierarchical Extensions

A central issue in hierarchical VQ-VAE models is codebook (or layer) collapse, where a large fraction of codes are never utilized, especially at higher layers (Takida et al., 2023). The causes include:

  • Deterministic quantization yielding zero gradient to unused codes.
  • Higher layers being underutilized if lower layers suffice for reconstruction.
  • EMA updates failing if codewords are not sufficiently visited.

Perplexity of codebook usage is an empirical indicator: for VQ-VAE-2 on FFHQ (top layer), perplexity is 64×6464\times646, signifying severe collapse (Takida et al., 2023).

To mitigate collapse, recent work has investigated fully Bayesian training regimes such as HQ-VAE, which introduces a stochastic quantizer and entropy regularization terms. These methods stochastically sample from all codewords early in training and maintain higher codebook utilization, improving reconstruction metrics and eliminating the need for EMA codebook resets (Takida et al., 2023).

5. Empirical Performance and Use Cases

VQ-VAE-2 demonstrates high-fidelity generation and reconstruction across large-scale datasets (Razavi et al., 2019):

  • On ImageNet 256×256:
    • MSE ≈ 0.005.
    • Negative log-likelihood of prior ≈3.40 bits/dim.
    • FID ≈ 30, IS ≈ 48 without sampling tricks; FID → 10, IS → 60 with classifier rejections; CAS surpasses BigGAN-deep.
  • On FFHQ (1024×1024): globally coherent face samples with long-range feature consistency.
  • Generates diversity across modes, especially in classes where GANs commonly fail (e.g., low-density categories).

Encoding and decoding are efficient due to convolutional architectures; sampling from the autoregressive prior in the discrete latent space is orders of magnitude faster than in pixel space (e.g., <10 ms for 256×256 on V100 GPUs).

6. Advances Beyond VQ-VAE-2

Subsequent research has targeted the expressivity and efficiency of hierarchical discrete autoencoders:

  • HR-VQVAE introduces residual quantization hierarchies, where each layer encodes the residuals unmodeled by lower layers and links codebooks hierarchically. This prevents codebook collapse, reduces decoding search-time (64×6464\times647 comparisons per position versus 64×6464\times648), and enables better codeword utilization. HR-VQVAE achieves faster reconstruction and improved FID/MSE relative to VQ-VAE-2 (Adiban et al., 2022).
  • HQ-VAE casts hierarchical vector quantized autoencoders within a Bayesian training paradigm, employing stochastic quantization and entropy-based regularization for robust codebook usage. This approach yields significant improvements in both codebook perplexity and perceptual fidelity, extending applicability to other modalities such as audio (Takida et al., 2023).

A table comparing representative codebook perplexities is provided below (Takida et al., 2023):

Model Codebook (Top Layer) Perplexity Codebook Size
VQ-VAE-2 ≈ 24 512
SQ-VAE-2 ≈ 126 512

7. Limitations and Theoretical Insights

The theoretical properties and architectural choices of VQ-VAE-2 and its successors offer several insights:

  • Multi-scale discrete representations disentangle global and local structures, with higher layers modeling coarse geometry and lower layers refining detail (Razavi et al., 2019, Adiban et al., 2022).
  • Codebook collapse remains a limiting factor for maximum codebook utilization until regularized Bayesian or residual approaches are adopted (Takida et al., 2023, Adiban et al., 2022).
  • Residual and hierarchically linked quantization enable increased codebook capacity without collapse, as observed in HR-VQVAE experiments (Adiban et al., 2022).
  • Improved codebook usage yields better generative diversity, reduced FID, and enhanced perceptual metrics, while increasing robustness to overfitting and redundant encoding (Adiban et al., 2022, Takida et al., 2023).

A plausible implication is that future extensions of VQ-VAE architectures will increasingly exploit hierarchical residual encoding and stochastic training objectives to maximize codebook expressivity for both vision and non-vision modalities.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vector Quantized Variational Autoencoders (VQ-VAE-2).