HR-VQVAE: Hierarchical Residual VQVAE

Updated 3 June 2026

The paper introduces a multi-level residual quantization mechanism that minimizes quantization error and prevents codebook collapse.
It employs a hierarchical structure where each level encodes the residuals of previous stages, reducing decoding complexity from O(m^n) to O(nm).
Empirical results demonstrate superior reconstruction fidelity and faster decoding in image and video generation compared to earlier VQ-VAE models.

The Hierarchical Residual Learning Vector Quantized Variational Autoencoder (HR-VQVAE) is a multi-level vector-quantized generative model that produces high-fidelity discrete representations by applying hierarchical quantization to the residuals of the encoding at each layer. This approach addresses quantization error, codebook collapse, and inefficiencies in search and decoding speed encountered by previous VQ-VAE methods. HR-VQVAE has been validated in both image generation/reconstruction and as a building block for spatiotemporal generative models for video prediction, yielding state-of-the-art performance under constrained model complexity (Adiban et al., 2022, Adiban et al., 2023).

1. Model Architecture

HR-VQVAE generalizes the traditional VQ-VAE by introducing a stack of residual quantization stages, each operating on the residual error left by previous stages. The core architectural components are:

Encoder $E_\theta(\cdot)$ : A convolutional (or convolution plus downsampling) network mapping input $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ to a dense continuous latent tensor $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ (Adiban et al., 2023).
Hierarchical Residual Quantizer: Consists of $n$ quantization levels. At level $i$ , a family of $M^{i-1}$ codebooks $C_i$ (each size $M$ ) is available, indexed by the code chosen at level $i-1$ (Adiban et al., 2023); each codeword lies in $\mathbb{R}^D$ . Only one codebook out of the hierarchy is active per spatial location, determined by prior assignments (Adiban et al., 2022).
Residual Computation: The process begins with $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 0. At each level $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 1, the nearest codeword $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 2 for each spatial location $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 3 is selected from the chosen codebook, and the residual is updated as $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 4 (Adiban et al., 2023).
Decoder $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 5: Receives the sum of quantized embeddings across all levels $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 6 and reconstructs the output $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 7 (Adiban et al., 2023).

The hierarchical structure forces each quantizer to encode only information not already explained by the previous stages, ensuring non-redundant representations (Adiban et al., 2022).

2. Hierarchical Residual Quantization Process

The HR-VQVAE quantization process is recursively defined over $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 8 layers:

Initialization: $\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}$ 9.
At each layer $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 0:
- Compute embedding: $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 1 (Adiban et al., 2022).
- Quantize $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 2 with the active codebook (selected via previous assignments): $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 3.
- Update residual: $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 4.
Decode: Combine quantized codes, $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 5, and reconstruct $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 6.

Only a single codebook of size $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 7 is searched per location per layer, yielding $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 8 decoding complexity instead of $z = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}$ 9 required by a flat codebook (Adiban et al., 2022).

3. Objective Function and Training

The loss for HR-VQVAE for one sample $n$ 0 is:

$n$ 1

Reconstruction loss: Forces fidelity between input and reconstruction.
Codebook loss for each layer: Shrinks codewords towards encoder outputs.
Commitment loss (with weighting $n$ 2): Forces encoder outputs towards the chosen codeword, promoting assignment stability.
$n$ 3 denotes the stop-gradient operator.

This structured residual loss distribution prevents codebook collapse and enforces that later stages encode only complementary information (Adiban et al., 2023, Adiban et al., 2022).

4. Practical Implementation and Hyperparameters

A typical three-layer HR-VQVAE employs:

Latent map size $n$ 4 (for images of size $n$ 5)
Codebook sizes per layer: $n$ 6, $n$ 7, $n$ 8
Codeword dimension $n$ 9
Commitment weight $i$ 0
Deep encoder/decoder networks with hidden channels (e.g., 128) and residual blocks per layer (e.g., 64)
Adam optimizer, initial learning rate $i$ 1, Polyak EMA decay 0.9 (Adiban et al., 2022)

Training steps vary by dataset, e.g., $i$ 2 (FFHQ), $i$ 3 (ImageNet). For generation, an autoregressive PixelCNN prior can be used over discrete latent codes (Adiban et al., 2022).

5. Empirical Performance and Analysis

HR-VQVAE consistently outperforms VQ-VAE and VQ-VAE-2:

Dataset	VQ-VAE	VQ-VAE-2	HR-VQVAE
FFHQ 256×256	2.86 / 0.00298	1.92 / 0.00195	1.26 / 0.00163
ImageNet 128×128	3.66 / 0.00055	2.94 / 0.00039	2.28 / 0.00027
CIFAR10 32×32	21.65 / 0.00092	18.03 / 0.00068	18.11 / 0.00041
MNIST 28×28	7.90 / 0.00041	6.70 / 0.00025	6.10 / 0.00011

(Table: FID↓ / MSE↓ for held-out test samples (Adiban et al., 2022).)

HR-VQVAE reconstructions preserve finer details than baselines and exhibit moderate MSE improvement.
Decoding speed is substantially improved (e.g., $i$ 4 for $i$ 5 images vs $i$ 6 for VQVAE at $i$ 7).
For high codebook cardinality, VQ-VAE and VQ-VAE-2 experience codebook collapse (increasing MSE); HR-VQVAE avoids this failure mode (Adiban et al., 2022).
In video prediction, the S-HR-VQVAE framework, which incorporates HR-VQVAE and an autoregressive spatiotemporal predictive model (AST-PM), achieves state-of-the-art quantitative and qualitative results on challenging benchmarks, including KTH Human Action and Human3.6M, with model size substantially reduced compared to alternatives (Adiban et al., 2023).

6. Advantages, Limitations, and Distinctions

Advantages:

Elimination of codebook collapse: Each residual quantizer encodes information not already represented, maximizing codebook utilization (Adiban et al., 2022).
Fast decoding: Hierarchical structure enables $i$ 8 decoding per location, achieving over $i$ 9 speedup compared to prior art (Adiban et al., 2022).
Improved quality at fixed bit rate: Hierarchical coarse-to-fine coding leads to lower quantization error (higher PSNR/SSIM) (Adiban et al., 2023).
Superior gradient propagation: Each VQ module fits a reduced, decorrelated residual, improving optimization stability (Adiban et al., 2023).
Easy scaling of codebook capacity: Supports very large total numbers of codes ( $M^{i-1}$ 0 effective patterns), without memory or utilization bottlenecks.

Limitations:

Hyperparameter sensitivity: Requires careful selection of the number of levels $M^{i-1}$ 1, branch factor $M^{i-1}$ 2, commitment weights $M^{i-1}$ 3, and residual encoder structures (Adiban et al., 2022).
Increased memory: Storage for $M^{i-1}$ 4 codewords per layer may become prohibitive at very high dimension/depth (Adiban et al., 2022).
Training cost: The multi-stage setup increases training time, although decoding is vastly accelerated (Adiban et al., 2022).

Distinctions from VQ-VAE and VQ-VAE-2:

VQ-VAE uses a flat latent grid and single codebook; VQ-VAE-2 employs a hierarchy but does not quantize true residuals at each stage, leading to redundancy and codebook underutilization.
HR-VQVAE leverages strict residual decomposition, hierarchical codebook selection, and composite decoding, resulting in higher effective code capacity and lower distortion (Adiban et al., 2022).

7. Applications and Extensions

HR-VQVAE underpins high-fidelity image reconstruction, efficient image generation with autoregressive neural priors (e.g., PixelCNN), and scalable spatiotemporal modeling in video generation tasks. Its integration as the perceptual backbone in S-HR-VQVAE demonstrates superior performance for video prediction across multidomain datasets by enabling both compact, high-capacity spatial representations and efficient entropy coding (Adiban et al., 2023).

A notable implication is the potential of HR-VQVAE architectures to serve as the standard discrete encoding backbone for low-latency, high-quality image, video, or sequential data compression and generation pipelines. The coarse-to-fine residual quantization and architectural modularity lend themselves to combination with transformer-based or diffusion-based temporal models and auto-regressive priors.

References:

(Adiban et al., 2023) "S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction"
(Adiban et al., 2022) "Hierarchical Residual Learning Based Vector Quantized Variational Autoencoder for Image Reconstruction and Generation"

Markdown Report Issue Upgrade to Chat

References (2)

Hierarchical Residual Learning Based Vector Quantized Variational Autoencoder for Image Reconstruction and Generation (2022)

S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Residual Learning Vector Quantized Variational Autoencoder (HR-VQVAE).

HR-VQVAE: Hierarchical Residual VQVAE

1. Model Architecture

2. Hierarchical Residual Quantization Process

3. Objective Function and Training

4. Practical Implementation and Hyperparameters

5. Empirical Performance and Analysis

6. Advantages, Limitations, and Distinctions

7. Applications and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HR-VQVAE: Hierarchical Residual VQVAE

1. Model Architecture

2. Hierarchical Residual Quantization Process

3. Objective Function and Training

4. Practical Implementation and Hyperparameters

5. Empirical Performance and Analysis

6. Advantages, Limitations, and Distinctions

7. Applications and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research