Residual-Quantized VAE (RQ-VAE)

Updated 6 November 2025

The paper demonstrates how integrating residual quantization into VQ-VAE exponentially increases the effective representational capacity (K^D) while mitigating codebook collapse.
RQ-VAE employs a multi-stage residual quantization process using a shared codebook and progressive commitment loss to accurately approximate image features with shorter code sequences.
Empirical results reveal that RQ-VAE achieves faster autoregressive sampling and competitive reconstruction fidelity on benchmarks like FFHQ and ImageNet, enabling efficient high-resolution generative modeling.

Residual-Quantized VAE (RQ-VAE) is a deep generative model that integrates residual quantization—a classical technique from signal compression—into the vector-quantized variational autoencoder (VQ-VAE) framework. It addresses several key challenges in high-fidelity image compression and autoregressive generation by enabling efficient, high-capacity discrete representations. The model is most influential as part of two-stage autoregressive generative frameworks, where a RQ-VAE is paired with an autoregressive model trained over discrete code sequences.

1. Mathematical Formulation and Quantization Process

Let $E$ and $G$ denote the encoder and decoder networks, respectively. Let $X \in \mathbb{R}^{H_o \times W_o \times 3}$ be the input image, downsampled to an encoded feature tensor $Z = E(X) \in \mathbb{R}^{H \times W \times n_z}$ . RQ-VAE replaces the single-step vector quantization (VQ) module with a multi-stage residual quantization procedure using a shared codebook $C = \{e(k) \in \mathbb{R}^{n_z}: k \in [K]\}$ .

Given quantization depth $D$ , the quantization of each $n_z$ -dimensional vector $z$ at spatial location $(h, w)$ proceeds recursively:

Set $r_0 = z$ .
For $d = 1, ..., D$ :

$k_d = \arg\min_{k \in [K]} \| r_{d-1} - e(k) \|_2^2; \quad r_d = r_{d-1} - e(k_d)$

The quantized representation is $(k_1, ..., k_D)$ and the reconstructed vector is:

$\hat{z}^{(D)} = \sum_{d=1}^D e(k_d)$

This residual quantization provides an exponentially larger effective representational capacity ( $K^D$ ) at a given codebook size and code map resolution, as opposed to increasing $K$ alone in standard VQ.

2. Training Objective and Architecture

Training is carried out in a VQ-VAE style loss with modifications to support residual quantization. The optimization objective is: $L = \|X - G(\hat{Z})\|_2^2 + \beta \sum_{d=1}^D \|Z - \mathrm{sg}[\hat{Z}^{(d)}]\|_2^2$ where $\hat{Z}^{(d)}$ is the quantized feature map after $d$ quantization steps, and $\mathrm{sg}[\cdot]$ denotes stop-gradient. The commitment loss is summed across all depths, progressively encouraging the encoder to generate features that are well approximated by the cumulative sum of code vectors. In practice, adversarial and perceptual losses (as in VQ-GAN) may also be added to aid perceptual quality.

Notably, the same codebook $C$ is used for all steps, distinguishing RQ-VAE from product quantization or VQ-VAE-2 variants with independent codebooks per depth or layer.

3. Compression and Rate-Distortion Properties

RQ-VAE directly addresses issues arising from the rate-distortion tradeoff that constrain prior VQ-based generative models:

Code sequence length: In standard VQ-VAE, reducing the code map resolution from $16\times16$ to $8\times8$ for $256\times256$ images would require $K \gg 2^{16}$ to maintain reconstruction fidelity—impractical due to codebook collapse and inefficiency.
Expressivity: With RQ-VAE, feature vectors are quantized as stacks of $D$ code indices, enabling the use of moderate $K$ values (e.g., $K=2^{14}$ ) while still realizing up to $K^D$ clusterings.
Coarse-to-fine approximation: Code usage exhibits a clear hierarchy, with the first depths capturing coarse structure and subsequent depths refining details.

At inference, the codes $(k_1, ..., k_D)$ at each spatial location can be entropy-coded, allowing standard compression workflows, and supporting fast GPU-based encoding and decoding due to the parallelizable architecture.

4. Computational and Practical Impact in AR Generation

RQ-VAE has had significant impact on the feasibility of high-resolution, autoregressive (AR) image generation with transformers (Lee et al., 2022):

Sequence length reduction: By decreasing code map spatial dimensions (e.g., $8\times8$ vs $16\times16$ ) and using stacked codes, the total code sequence length for AR modeling is drastically shorter. This reduces the quadratic computational and memory overhead in AR transformers, directly enabling large-batch, high-resolution experiments.
Training stability: The use of a fixed, shared codebook across all depths guards against codebook collapse, increasing code utilization and improving learning dynamics relative to flat VQ at very large $K$ .
Fidelity-efficiency tradeoff: On benchmarks such as FFHQ and ImageNet, RQ-VAE with $8 \times 8 \times 4$ codes and $K=16,384$ achieves competitive or superior FID as compared to VQ-GAN using $16\times16$ codes, without exponential increases in codebook size.

5. Relationship to Hierarchical and Bayesian Extensions

RQ-VAE can be interpreted as a deterministic, residual quantization model within a larger spectrum of hierarchical discrete generative models. It differs from:

Quantized Hierarchical VAEs (Duan et al., 2022), which use multiple levels of hierarchical latent variables with joint quantization-aware priors and posteriors and allow fully parallel encoding and decoding. In contrast, RQ-VAE applies quantization sequentially onto residuals and is more naturally serial in structure.
Soft quantization and Bayesian extensions: Subsequent work (HQ-VAE (Takida et al., 2023)) generalizes RQ-VAE by introducing stochastic quantization in a principled variational Bayes setting, addressing codebook and layer collapse and improving stability by learning an explicit posterior over discrete indices. This suggests that stochastic, variational approaches can further enhance representational robustness, particularly for deeper hierarchies or hybrid architectures.

6. Empirical Results and Comparisons

RQ-VAE consistently outperforms traditional VQ-VAE and competitive VQ-GAN configurations in AR generative modeling benchmarks for high-resolution images. Key findings:

Reconstruction FID: At $8 \times 8$ code resolution with $D=4$ , and $K=16,384$ , rFID matches or exceeds that of state-of-the-art VQ-GAN models using $16 \times 16$ code maps.
Sampling Speed: Due to code sequence length reduction, paired AR transformers ("RQ-Transformer") operate 4–7x faster in sampling compared to VQ-GAN transformers under practical settings.
Global context: Shorter sequences enable AR models to more fully leverage global context, improving image quality, especially in unconstrained or large diversity datasets.
Code usage: The model demonstrates thorough codebook utilization across depths, mitigating codebook collapse.

Aspect	RQ-VAE	Standard VQ-VAE
Quantization	Residual (stacked); sequential on residuals	Flat; one-step
Codebook usage	Shared per depth; high utilization for moderate $K$	Collapse for large $K$
AR sequence length	Short (due to low-res code map; each loc. $D$ codes)	Longer (flattened map)
Expressivity	$K^D$ code points per vector	$K$ per vector
Compression workflow	Entropy coding of stacked codes	Entropy coding of flat codes
Empirical FID/speed	Superior at high resolution, faster sampling	Inferior for high-res

RQ-VAE is closely related to residual vector quantization (RQ) (Ferdowsi et al., 2017), product quantization, and hierarchical VQ architectures but is distinguished by:

Its full neural, end-to-end optimization of encoder/decoder/codebook,
Use of a shared codebook (not layer-specific),
Hard quantization at each residual with deterministic assignments (subsequent extensions (Takida et al., 2023) introduce stochastic/Bayesian quantization),
Its target application to scalable, high-capacity discrete AR generative models.

Subsequent advancements have focused on introducing principled stochasticity and variational inference (HQ-VAE), alternative codebook structures (tree, hierarchical, or hyperbolic geometry (Piękos et al., 18 May 2025)), and combining residual quantization with other structural innovations for further improvements.

RQ-VAE thus constitutes a methodologically robust and empirically validated solution to the longstanding issues of rate-distortion scaling and codebook utilization in discrete generative modeling, forming the backbone of fast, high-resolution, autoregressive image generators. Its significance extends to efficient compression, dense generative modeling for both images and other data modalities, and as a bridge to more advanced, hierarchical discrete representation learning frameworks.