VQ-VAE-2: Multi-Scale Hierarchical Generative Models

Updated 2 June 2026

VQ-VAE-2 is a multi-scale hierarchical generative model that applies vector quantization to continuous latent spaces for efficient, high-fidelity image reconstruction.
It leverages multiple quantization layers and autoregressive priors to capture both global structures and local details in complex data distributions.
Advancements such as improved codebook utilization and Bayesian training regimes mitigate collapse and boost sample diversity, making it competitive with GANs.

Vector Quantized Variational Autoencoders (VQ-VAE-2) are multi-scale hierarchical generative models that use vector quantization on continuous latent representations within a variational autoencoding framework. Designed to capture both global and local dependencies in data, VQ-VAE-2 leverages discrete latent variables and autoregressive priors for high-fidelity image generation and efficient encoding-decoding. The method is built upon the original VQ-VAE architecture but introduces multiple quantization layers associated with distinct spatial scales, enabling reconstruction of complex data distributions with improved mode coverage and sample diversity compared to earlier VAEs and GANs (Razavi et al., 2019).

1. Multi-Scale Hierarchical Architecture

VQ-VAE-2 implements a hierarchical encoder-decoder structure. For image data (e.g., ImageNet 256×256), it utilizes two or more quantization layers operating at progressively coarser spatial resolutions. The process is as follows (Razavi et al., 2019, Takida et al., 2023):

The input $x$ $x$ is processed by a bottom-up stack of convolutional and pooling layers to generate feature maps:
- Bottom-level map: $h_{\text{bot}}$ (e.g., $64\times64$ for ImageNet 256)
- Top-level map: $h_{\text{top}}$ (e.g., $32\times32$ )
Each level yields a continuous embedding, which is quantized via a codebook:

$z^l_q(i) = \arg\min_j \|z^l_e(i) - e^l_j\|_2^2 \quad\text{where } l\in\{\text{bot},\text{top}\}$

and $e^l_j$ denotes the $j$ -th entry in codebook $E_l$ of size $K_l$ with embedding dimension $h_{\text{bot}}$ 0.

The decoder reconstructs $h_{\text{bot}}$ $h_{bot}$ 1 by combining quantized embeddings from all layers:
- A coarse reconstruction is first formed from $h_{\text{bot}}$ 2, then $h_{\text{bot}}$ 3 is injected (e.g., via concatenation) before the final reconstruction $h_{\text{bot}}$ 4.

In three-level hierarchies (e.g., FFHQ 1024×1024), the procedure is recursively extended to lower-resolution code maps.

2. Vector Quantization, Loss Functions, and Training

Each encoder output is quantized to the nearest codebook vector, yielding discrete latents. The model is trained using losses at every scale (Razavi et al., 2019, Takida et al., 2023):

Reconstruction loss:

$h_{\text{bot}}$ 5

Codebook update ("push" loss):

$h_{\text{bot}}$ 6

Commitment loss (encouraging encoder outputs to match CODE vectors):

$h_{\text{bot}}$ 7

where $h_{\text{bot}}$ 8 denotes the stop-gradient operator.

The total objective for VQ-VAE-2 is a sum of these terms over all levels and spatial locations. In practice, exponential moving average (EMA) updates are often used for codebooks to improve stability.

VQ-VAE-2 codebooks commonly use $h_{\text{bot}}$ 9 embeddings of $64\times64$ 0 dimensions per level, with commitment $64\times64$ 1.

3. Autoregressive Priors and Sampling

The generative capability of VQ-VAE-2 is achieved by fitting autoregressive priors (PixelCNN or PixelSnail) over the discrete latent maps (Razavi et al., 2019):

The top-level code map prior $64\times64$ 2 is modeled by a PixelSnail or PixelCNN with gated convolutions and self-attention on the low-resolution space.
The bottom-level code map prior $64\times64$ 3 is a conditional PixelCNN, conditioned on the top-level code map.
The full prior factorizes as

$64\times64$ 4

After training, new samples are generated by sequentially sampling the top code map, then the bottom (conditioned), followed by one-shot feed-forward decoding to $64\times64$ 5.

This design substantially accelerates sampling compared to pixel-space autoregressive models and allows capturing both global and local image structures.

4. Codebook Collapse and Hierarchical Extensions

A central issue in hierarchical VQ-VAE models is codebook (or layer) collapse, where a large fraction of codes are never utilized, especially at higher layers (Takida et al., 2023). The causes include:

Deterministic quantization yielding zero gradient to unused codes.
Higher layers being underutilized if lower layers suffice for reconstruction.
EMA updates failing if codewords are not sufficiently visited.

Perplexity of codebook usage is an empirical indicator: for VQ-VAE-2 on FFHQ (top layer), perplexity is $64\times64$ 6, signifying severe collapse (Takida et al., 2023).

To mitigate collapse, recent work has investigated fully Bayesian training regimes such as HQ-VAE, which introduces a stochastic quantizer and entropy regularization terms. These methods stochastically sample from all codewords early in training and maintain higher codebook utilization, improving reconstruction metrics and eliminating the need for EMA codebook resets (Takida et al., 2023).

5. Empirical Performance and Use Cases

VQ-VAE-2 demonstrates high-fidelity generation and reconstruction across large-scale datasets (Razavi et al., 2019):

On ImageNet 256×256:
- MSE ≈ 0.005.
- Negative log-likelihood of prior ≈3.40 bits/dim.
- FID ≈ 30, IS ≈ 48 without sampling tricks; FID → 10, IS → 60 with classifier rejections; CAS surpasses BigGAN-deep.
On FFHQ (1024×1024): globally coherent face samples with long-range feature consistency.
Generates diversity across modes, especially in classes where GANs commonly fail (e.g., low-density categories).

Encoding and decoding are efficient due to convolutional architectures; sampling from the autoregressive prior in the discrete latent space is orders of magnitude faster than in pixel space (e.g., <10 ms for 256×256 on V100 GPUs).

6. Advances Beyond VQ-VAE-2

Subsequent research has targeted the expressivity and efficiency of hierarchical discrete autoencoders:

HR-VQVAE introduces residual quantization hierarchies, where each layer encodes the residuals unmodeled by lower layers and links codebooks hierarchically. This prevents codebook collapse, reduces decoding search-time ( $64\times64$ 7 comparisons per position versus $64\times64$ 8), and enables better codeword utilization. HR-VQVAE achieves faster reconstruction and improved FID/MSE relative to VQ-VAE-2 (Adiban et al., 2022).
HQ-VAE casts hierarchical vector quantized autoencoders within a Bayesian training paradigm, employing stochastic quantization and entropy-based regularization for robust codebook usage. This approach yields significant improvements in both codebook perplexity and perceptual fidelity, extending applicability to other modalities such as audio (Takida et al., 2023).

A table comparing representative codebook perplexities is provided below (Takida et al., 2023):

Model	Codebook (Top Layer) Perplexity	Codebook Size
VQ-VAE-2	≈ 24	512
SQ-VAE-2	≈ 126	512

7. Limitations and Theoretical Insights

The theoretical properties and architectural choices of VQ-VAE-2 and its successors offer several insights:

Multi-scale discrete representations disentangle global and local structures, with higher layers modeling coarse geometry and lower layers refining detail (Razavi et al., 2019, Adiban et al., 2022).
Codebook collapse remains a limiting factor for maximum codebook utilization until regularized Bayesian or residual approaches are adopted (Takida et al., 2023, Adiban et al., 2022).
Residual and hierarchically linked quantization enable increased codebook capacity without collapse, as observed in HR-VQVAE experiments (Adiban et al., 2022).
Improved codebook usage yields better generative diversity, reduced FID, and enhanced perceptual metrics, while increasing robustness to overfitting and redundant encoding (Adiban et al., 2022, Takida et al., 2023).

A plausible implication is that future extensions of VQ-VAE architectures will increasingly exploit hierarchical residual encoding and stochastic training objectives to maximize codebook expressivity for both vision and non-vision modalities.