RQ-VAE: Residual Quantization VAE

Updated 29 December 2025

RQ-VAE is a discrete latent variable model that recursively quantizes residual errors to build a virtual codebook with exponential capacity.
It sequentially refines quantization via multiple stages, reducing autoregressive complexity while enhancing reconstruction quality.
It enables efficient generative modeling in applications such as high-resolution image synthesis and 3D motion generation, addressing common VQ-VAE limitations.

Residual Quantization Variational Autoencoder (RQ-VAE) is a class of discrete latent variable models that implements hierarchical or multi-stage vector quantization (VQ) within a variational autoencoding framework. The primary goal of RQ-VAE is to achieve precise and compact discrete representations of high-dimensional data—such as images or sequences—by recursively quantizing the residual error of previous quantizations. This approach enables efficient autoregressive (AR) modeling with shorter code sequences and smaller codebooks while maintaining high representational capacity and fidelity, solving key limitations inherent to single-stage VQ-VAEs. RQ-VAE has been successfully extended and analyzed in diverse domains, including high-resolution image synthesis, 3D motion generation, and as a building block for hierarchical Bayesian VAEs.

1. RQ-VAE Architecture and Quantization Mechanism

RQ-VAE replaces standard single-level vector quantization with a stack of $D$ (or $L$ ) residual quantizers applied in a coarse-to-fine manner. The canonical architecture consists of the following components:

Encoder $E$ : Encodes an input $\mathbf{X} \in \mathbb{R}^{H_0 \times W_0 \times C}$ to a feature map $Z = E(\mathbf{X}) \in \mathbb{R}^{H \times W \times n_z}$ , where $(H,W)$ is typically a spatial downsampling of the input.
Residual Quantizers: At each spatial location, a shared codebook $C = \{ e(1), \dots, e(K) \} \subset \mathbb{R}^{n_z}$ is used. The quantization proceeds in $D$ $D$ stages:
1. Initialize residual $r_0 = Z_{h,w}$ .
2. For stage $d = 1, \dots, D$ :
$k_d = \arg\min_{k \in [K]} \| r_{d-1} - e(k) \|_2^2, \quad r_d = r_{d-1} - e(k_d)$

The final quantized latent is $\hat{z}^{(D)} = \sum_{i=1}^D e(k_i)$ .
The stack of codes $M_{h,w,1:D} \in [K]^D$ per location encodes exponentially many virtual clusters.

Decoder $G$ : Mirrors the encoder and reconstructs the input $\hat{\mathbf{X}} = G(\hat{Z}^{(D)})$ , where $\hat{Z}^{(D)}$ aggregates all quantized increments per spatial position.

This residual quantization can be implemented with shared codebooks across depths (Lee et al., 2022) or with layer-specific codebooks and hierarchical conditional linking (Adiban et al., 2022, Liu et al., 3 Nov 2025), depending on the variant and task.

2. Training Objectives and Mathematical Formulation

RQ-VAE models are trained by optimizing a combination of reconstruction and commitment objectives, with optional guidance from adversarial and perceptual losses. The key components are:

Reconstruction Loss:

$L_{\mathrm{rec}} = \| \mathbf{X} - \hat{\mathbf{X}} \|_2^2 \quad \text{or} \quad L_1 \text{ in sequence domains}$

where $\hat{\mathbf{X}}$ is the reconstructed output.

Commitment Loss: To ensure encoder latents remain close to the quantized codes,

$L_{\mathrm{commit}} = \sum_{d=1}^D \| Z - \text{sg}(\hat{Z}^{(d)}) \|_2^2$

where $\text{sg}[\cdot]$ denotes stop-gradient.

The total training loss is

$L = L_{\mathrm{rec}} + \beta L_{\mathrm{commit}}$

where $\beta$ is a hyperparameter (typically $1.0$ (Lee et al., 2022)).

Alternative and Bayesian Training Objectives:

Within HQ-VAE, RQ-VAE is a special case where only residual-refinement layers are used, allowing a fully probabilistic training scheme via maximization of the evidence lower bound (ELBO). This approach introduces stochastic dequantized auxiliaries $\tilde{Z}$ and entropy-based regularizers to improve codebook usage and stability (Takida et al., 2023).

3. Comparison with Single-Stage and Hierarchical VQ-VAEs

Standard VQ-VAE selects a single nearest code for each encoder output, limiting flexibility. To maintain fidelity at low spatial resolution, the codebook must grow exponentially—causing codebook collapse, memory issues, and inefficient code assignment. RQ-VAE, by applying $D$ (or $L$ ) small code selections, offers a virtual codebook of size $K^D$ without additional parameters and allows the effective code sequence length in AR models to be greatly shortened.

Variants such as HR-VQVAE (Adiban et al., 2022) extend RQ-VAE with hierarchical codebook structures and conditional linking, with each residual quantization layer focusing on unexplained information from previous layers. Each codebook at depth $i$ comprises $m^{i-1}$ sub-codebooks, further improving codebook utilization and combatting collapse.

In HQ-VAE (Takida et al., 2023), RQ-VAE falls out as the purely residual-refinement pathway, taking advantage of unified variational-Bayesian training and learned quantization variances.

4. Practical Implementation Details and Hyperparameters

Typical configuration (as established in generative image and motion modeling studies (Lee et al., 2022, Liu et al., 3 Nov 2025)):

Component	Parameter	Typical Value
Codebook size	$K$	16,384 (ImageNet); 2,048 (FFHQ)
Depth (# quantization stages)	$D$ / $L$	2–8; 4 in most AR image models
Embedding dimension	$n_z$	256
Spatial latent resolution	$H \times W$	8×8 for $256\times256$ images
Optimizer	Adam; $(\beta_1,\beta_2)$	(0.5, 0.9)
Learning rate	$4 \times 10^{-5}$
Batch size	128 (ImageNet)
Commitment loss weight	$\beta$	1.0

Codebook maintenance:

Exponential moving average (EMA) updates or variational Bayes with entropy regularization prevent codebook collapse and ensure uniform utilization (Lee et al., 2022, Takida et al., 2023).

5. Empirical Performance and Applications

RQ-VAE achieves lower distortion at reduced code lengths and enables efficient AR generation. On ImageNet, for $256\times256$ images at $8\times8$ resolution ( $D=4$ , $K=16,384$ ), RQ-VAE matches or surpasses VQ-GAN's fidelity at twice shorter code maps—reducing AR modeling complexity substantially (Lee et al., 2022). Empirical comparisons show:

Model	Latent map	rFID/ImageNet	Decoding speed improvement
VQ-GAN (16×16×1)	$K=16,384$	$4.3$	—
RQ-VAE (8×8×4)	$K=16,384$	$4.7$	$2\times$ faster AR prior
RQ-VAE (8×8×8)	$K=16,384$	$2.7$	—

In HR-VQVAE (Adiban et al., 2022), residual quantization yields improved FID, MSE, and SSIM over VQ-VAE and VQ-VAE-2 with three residual layers, and reduces decoding time by up to an order of magnitude.

RQ-VAE has also been applied to non-image domains, notably 3D human motion generation (Liu et al., 3 Nov 2025). Here, each motion sequence is encoded, residual-quantized across $Q$ layers (with optional multi-scale token preservation strategies), and decoded with high fidelity. Increasing $Q$ systematically improves reconstruction FID, demonstrating the additive effect of residual quantization depth.

6. Extensions, Hierarchical Schemes, and Bayesian Variants

RQ-VAE serves as a building block for more general hierarchical quantization architectures (e.g., HQ-VAE (Takida et al., 2023)), in which both residual refinement and multi-resolution feature injection can be mixed. The Bayesian HQ-VAE introduces stochastic dequantization, entropy control, and automatic variance annealing to maximize ELBO, removing the need for heuristics such as EMA codebook updates or stop-gradient operations.

MTPS (Multi-scale Token Preservation Strategy), introduced in MoSa (Liu et al., 3 Nov 2025), leverages RQ-VAE's hierarchical quantization to efficiently downsample and quantize at successive scales, significantly reducing autoregressive modeling costs in generative transformers for motion synthesis.

7. Limitations and Considerations

RQ-VAE's advantages depend critically on residual quantization depth, codebook update strategies, and commitment/entropy regularization. While stacking quantization layers increases representational capacity, practical performance saturates or may degrade if codebooks are too small or depth too high without sufficient data or appropriate regularization. Bayesian variants provide systematic methods to avoid codebook collapse, but may introduce additional computational overhead from sampling and entropy terms. In hierarchical variants, proper management of codebook trees, residual targets, and parameter scaling is essential.

References

F. Yu et al., "Autoregressive Image Generation using Residual Quantization" (Lee et al., 2022)
H. Saito et al., "HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes" (Takida et al., 2023)
A. Madadi et al., "Hierarchical Residual Learning Based Vector Quantized Variational Autoencoder for Image Reconstruction and Generation" (Adiban et al., 2022)
H. Zhang et al., "MoSa: Motion Generation with Scalable Autoregressive Modeling" (Liu et al., 3 Nov 2025)