Papers
Topics
Authors
Recent
Search
2000 character limit reached

HR-VQVAE: Hierarchical Residual VQVAE

Updated 3 June 2026
  • The paper introduces a multi-level residual quantization mechanism that minimizes quantization error and prevents codebook collapse.
  • It employs a hierarchical structure where each level encodes the residuals of previous stages, reducing decoding complexity from O(m^n) to O(nm).
  • Empirical results demonstrate superior reconstruction fidelity and faster decoding in image and video generation compared to earlier VQ-VAE models.

The Hierarchical Residual Learning Vector Quantized Variational Autoencoder (HR-VQVAE) is a multi-level vector-quantized generative model that produces high-fidelity discrete representations by applying hierarchical quantization to the residuals of the encoding at each layer. This approach addresses quantization error, codebook collapse, and inefficiencies in search and decoding speed encountered by previous VQ-VAE methods. HR-VQVAE has been validated in both image generation/reconstruction and as a building block for spatiotemporal generative models for video prediction, yielding state-of-the-art performance under constrained model complexity (Adiban et al., 2022, Adiban et al., 2023).

1. Model Architecture

HR-VQVAE generalizes the traditional VQ-VAE by introducing a stack of residual quantization stages, each operating on the residual error left by previous stages. The core architectural components are:

  • Encoder Eθ(â‹…)E_\theta(\cdot): A convolutional (or convolution plus downsampling) network mapping input x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I} to a dense continuous latent tensor z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D} (Adiban et al., 2023).
  • Hierarchical Residual Quantizer: Consists of nn quantization levels. At level ii, a family of Mi−1M^{i-1} codebooks CiC_i (each size MM) is available, indexed by the code chosen at level i−1i-1 (Adiban et al., 2023); each codeword lies in RD\mathbb{R}^D. Only one codebook out of the hierarchy is active per spatial location, determined by prior assignments (Adiban et al., 2022).
  • Residual Computation: The process begins with x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}0. At each level x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}1, the nearest codeword x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}2 for each spatial location x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}3 is selected from the chosen codebook, and the residual is updated as x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}4 (Adiban et al., 2023).
  • Decoder x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}5: Receives the sum of quantized embeddings across all levels x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}6 and reconstructs the output x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}7 (Adiban et al., 2023).

The hierarchical structure forces each quantizer to encode only information not already explained by the previous stages, ensuring non-redundant representations (Adiban et al., 2022).

2. Hierarchical Residual Quantization Process

The HR-VQVAE quantization process is recursively defined over x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}8 layers:

  1. Initialization: x∈RHI×WI×DI\mathbf{x} \in \mathbb{R}^{H_I \times W_I \times D_I}9.
  2. At each layer z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}0:
    • Compute embedding: z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}1 (Adiban et al., 2022).
    • Quantize z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}2 with the active codebook (selected via previous assignments): z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}3.
    • Update residual: z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}4.
  3. Decode: Combine quantized codes, z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}5, and reconstruct z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}6.

Only a single codebook of size z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}7 is searched per location per layer, yielding z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}8 decoding complexity instead of z=Eθ(x)∈RH×W×Dz = E_\theta(\mathbf{x}) \in \mathbb{R}^{H \times W \times D}9 required by a flat codebook (Adiban et al., 2022).

3. Objective Function and Training

The loss for HR-VQVAE for one sample nn0 is:

nn1

  • Reconstruction loss: Forces fidelity between input and reconstruction.
  • Codebook loss for each layer: Shrinks codewords towards encoder outputs.
  • Commitment loss (with weighting nn2): Forces encoder outputs towards the chosen codeword, promoting assignment stability.
  • nn3 denotes the stop-gradient operator.

This structured residual loss distribution prevents codebook collapse and enforces that later stages encode only complementary information (Adiban et al., 2023, Adiban et al., 2022).

4. Practical Implementation and Hyperparameters

A typical three-layer HR-VQVAE employs:

  • Latent map size nn4 (for images of size nn5)
  • Codebook sizes per layer: nn6, nn7, nn8
  • Codeword dimension nn9
  • Commitment weight ii0
  • Deep encoder/decoder networks with hidden channels (e.g., 128) and residual blocks per layer (e.g., 64)
  • Adam optimizer, initial learning rate ii1, Polyak EMA decay 0.9 (Adiban et al., 2022)

Training steps vary by dataset, e.g., ii2 (FFHQ), ii3 (ImageNet). For generation, an autoregressive PixelCNN prior can be used over discrete latent codes (Adiban et al., 2022).

5. Empirical Performance and Analysis

HR-VQVAE consistently outperforms VQ-VAE and VQ-VAE-2:

Dataset VQ-VAE VQ-VAE-2 HR-VQVAE
FFHQ 256×256 2.86 / 0.00298 1.92 / 0.00195 1.26 / 0.00163
ImageNet 128×128 3.66 / 0.00055 2.94 / 0.00039 2.28 / 0.00027
CIFAR10 32×32 21.65 / 0.00092 18.03 / 0.00068 18.11 / 0.00041
MNIST 28×28 7.90 / 0.00041 6.70 / 0.00025 6.10 / 0.00011

(Table: FID↓ / MSE↓ for held-out test samples (Adiban et al., 2022).)

  • HR-VQVAE reconstructions preserve finer details than baselines and exhibit moderate MSE improvement.
  • Decoding speed is substantially improved (e.g., ii4 for ii5 images vs ii6 for VQVAE at ii7).
  • For high codebook cardinality, VQ-VAE and VQ-VAE-2 experience codebook collapse (increasing MSE); HR-VQVAE avoids this failure mode (Adiban et al., 2022).
  • In video prediction, the S-HR-VQVAE framework, which incorporates HR-VQVAE and an autoregressive spatiotemporal predictive model (AST-PM), achieves state-of-the-art quantitative and qualitative results on challenging benchmarks, including KTH Human Action and Human3.6M, with model size substantially reduced compared to alternatives (Adiban et al., 2023).

6. Advantages, Limitations, and Distinctions

Advantages:

  • Elimination of codebook collapse: Each residual quantizer encodes information not already represented, maximizing codebook utilization (Adiban et al., 2022).
  • Fast decoding: Hierarchical structure enables ii8 decoding per location, achieving over ii9 speedup compared to prior art (Adiban et al., 2022).
  • Improved quality at fixed bit rate: Hierarchical coarse-to-fine coding leads to lower quantization error (higher PSNR/SSIM) (Adiban et al., 2023).
  • Superior gradient propagation: Each VQ module fits a reduced, decorrelated residual, improving optimization stability (Adiban et al., 2023).
  • Easy scaling of codebook capacity: Supports very large total numbers of codes (Mi−1M^{i-1}0 effective patterns), without memory or utilization bottlenecks.

Limitations:

  • Hyperparameter sensitivity: Requires careful selection of the number of levels Mi−1M^{i-1}1, branch factor Mi−1M^{i-1}2, commitment weights Mi−1M^{i-1}3, and residual encoder structures (Adiban et al., 2022).
  • Increased memory: Storage for Mi−1M^{i-1}4 codewords per layer may become prohibitive at very high dimension/depth (Adiban et al., 2022).
  • Training cost: The multi-stage setup increases training time, although decoding is vastly accelerated (Adiban et al., 2022).

Distinctions from VQ-VAE and VQ-VAE-2:

  • VQ-VAE uses a flat latent grid and single codebook; VQ-VAE-2 employs a hierarchy but does not quantize true residuals at each stage, leading to redundancy and codebook underutilization.
  • HR-VQVAE leverages strict residual decomposition, hierarchical codebook selection, and composite decoding, resulting in higher effective code capacity and lower distortion (Adiban et al., 2022).

7. Applications and Extensions

HR-VQVAE underpins high-fidelity image reconstruction, efficient image generation with autoregressive neural priors (e.g., PixelCNN), and scalable spatiotemporal modeling in video generation tasks. Its integration as the perceptual backbone in S-HR-VQVAE demonstrates superior performance for video prediction across multidomain datasets by enabling both compact, high-capacity spatial representations and efficient entropy coding (Adiban et al., 2023).

A notable implication is the potential of HR-VQVAE architectures to serve as the standard discrete encoding backbone for low-latency, high-quality image, video, or sequential data compression and generation pipelines. The coarse-to-fine residual quantization and architectural modularity lend themselves to combination with transformer-based or diffusion-based temporal models and auto-regressive priors.


References:

  • (Adiban et al., 2023) "S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction"
  • (Adiban et al., 2022) "Hierarchical Residual Learning Based Vector Quantized Variational Autoencoder for Image Reconstruction and Generation"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Residual Learning Vector Quantized Variational Autoencoder (HR-VQVAE).