RQ-Transformer: Efficient AR Generation

Updated 6 November 2025

RQ-Transformer is a class of autoregressive models that uses stacked residual quantization to represent images as compact code stacks for high-resolution generation.
It integrates a two-stage framework with an RQ-VAE for code assignment and spatial plus depth transformers to predict codes in a coarse-to-fine order.
Empirical results show significant efficiency improvements, including up to 7.3× faster sampling and superior FID scores compared to previous VQ-based methods.

RQ-Transformer refers to a class of autoregressive models that utilize residual quantization (RQ) for efficient, high-fidelity generation, particularly in the context of high-resolution images. This approach hinges on the representation of images as stacked maps of discrete codes, enabling drastically shorter code sequences for autoregressive modeling. The RQ-Transformer architecture is distinguished by its two-stage framework integrating Residual-Quantized VAE (RQ-VAE) for code assignment, followed by a specialized transformer that models code stacks spatially and in depth.

1. Residual Quantization and RQ-VAE Fundamentals

Residual quantization (RQ) generalizes classical vector quantization (VQ) by recursively quantizing the residual of a latent vector, thus decomposing it into a stack of discrete codes. Given a feature vector $z \in \mathbb{R}^{n_z}$ , residual quantization produces a sequence $\{k_1,\dots,k_D\}$ as follows: $\begin{aligned} k_d &= Q(r_{d-1}; C), \ r_d &= r_{d-1} - e(k_d),\quad r_0 = z \end{aligned}$ where $Q(z;C)$ selects the nearest code in the codebook $C$ , and $e(k)$ is the code embedding mapping for index $k$ . The quantized vector at depth $d$ is given by $\hat{z}^{(d)} = \sum_{i=1}^d e(k_i)$ .

In RQ-VAE, an input image is mapped to a feature representation and quantized depth-wise at each spatial position, yielding a stack of $D$ codes per position (e.g., for a $256\times256$ image, a typical implementation uses an $8\times8$ grid with depth $D=4$ , i.e., $8\times8\times4$ codes).

2. RQ-Transformer Architecture: Two-Level Autoregression

The RQ-Transformer introduces a two-level autoregressive model:

Spatial Transformer processes sequences of spatial positions (raster scan order on the quantized feature map), generating global image context vectors.
Depth Transformer predicts the sequence of $D$ codes (the code stack) at each spatial position, in coarse-to-fine order conditioned on preceding spatial positions and previously generated codes within the stack.

The joint probability of an entire code sequence is factorized as: $p(S) = \prod_{t=1}^T \prod_{d=1}^D p(S_{td}\ |\ S_{<t,\cdot},\ S_{t,<d})$ where $T$ is the number of spatial positions ( $H\times W$ ), $D$ quantization depth, $S_{td}$ code index at position $t$ , depth $d$ .

Inputs to each transformer are constructed as:

Spatial transformer: $u_t = PE_T(t) + \sum_{d=1}^D e(S_{t-1,d})$
Depth transformer: $v_{td} = PE_D(d) + \sum_{d'=1}^{d-1} e(S_{td'})$

Context vectors from the spatial transformer, $h_t$ , are used by the depth-wise transformer, which outputs predicted code distributions for each stack position.

3. Rate-Distortion Trade-off and Computational Efficiency

Standard VQ-based autoregressive models cannot achieve both short sequence length and high-fidelity generation due to codebook size constraints. RQ-Transformer addresses this by stacking codes per position, dramatically reducing sequence length (from $T\times D$ to $T\times D$ , with $T$ much smaller due to aggressive spatial down-sampling enabled by the RQ-VAE). This short, information-rich sequence facilitates transformer self-attention over long-range code interactions without prohibitive cost.

Computational complexity:

VQ-Transformer: $O(NT^2 D^2)$
RQ-Transformer: $O(NT^2 + NT D^2)$ (Reduced complexity stems from spatial and depth-wise separation in code modeling.)

These design choices allow larger batch sizes and faster sampling. Empirical results demonstrate up to 7.3× faster image sampling than VQ-GAN-Transformer at comparable scale, and negligible degradation of perceptual quality.

4. Generation Procedure and AR Modeling

Generation involves raster scanning spatial positions and, at each position, sequentially decoding the code stack from coarse to fine. For each code $S_{td}$ , the conditional probability is modeled based on all previous spatial positions and prior depth codes at the current location. The negative log-likelihood loss for training is: $L_{AR} = \mathbb{E}_{S}\ \mathbb{E}_{t,d}\left[-\log p(S_{td}\ |\ S_{<t,d},\ S_{t,<d}) \right]$

A stronger approximation of the feature map enables reduction of code map resolution (e.g., $8\times8$ instead of $16\times16$ for $256\times256$ images) without loss in reconstruction quality. This is not achievable using single-code VQ-VAE without exponentially increasing the codebook size.

5. Empirical Performance and Benchmarking

Experimental results document significant performance improvements over prior autoregressive models.

Model	Cat	Bedroom	Church	FFHQ
VQ-GAN	17.31	6.35	7.81	11.4
ImageBART	15.09	5.51	7.32	9.57
RQ-Transformer	8.64	3.04	7.45	10.38

Class-conditional image generation on ImageNet ( $256 \times 256$ ) yields FID 7.55 at 3.8B parameters (FID further improves to 3.80 with classifier guidance). On text-conditioned CC-3M, results (FID 12.33, CLIP score 0.26) surpass VQ-GAN and ImageBART models.

Ablations reveal increasing quantization depth $D$ (with fixed codebook size) improves reconstruction fidelity substantially more than increasing codebook size. Sampling speed scales efficiently with batch size; RQ-Transformer achieves batch 500 at 0.02 sec/image, outperforming VQ-GAN-Transformer’s 0.15 sec/image at batch 200.

6. Extensions, Design Trade-offs, and Applications

The RQ-Transformer provides several key advantages:

By leveraging stacked quantization and the two-level transformer, it enables high-resolution AR modeling with manageable sequence lengths.
Computational efficiency fosters large-scale training and rapid inference, supporting deployment scenarios where sampling speed is critical (e.g., conditional and text-to-image generation).
Algorithmic separation between spatial and depth-wise prediction allows flexible design for downstream applications and further adaptation (cf. DnD-Transformer (Chen et al., 2 Oct 2024) and Contextual RQ-Transformer (Lee et al., 2022)).

The approach addresses fundamental rate-distortion trade-offs by stacking codes, maintains perceptual image quality, and is extensible to more complex modalities (e.g., vision-language, motion synthesis in Mogo (Fu, 5 Dec 2024), multimodal recommendation in MMGRec (Liu et al., 25 Apr 2024)).

7. Summary Table: RQ-Transformer Highlights

Aspect	Details
Input	Code stack from RQ-VAE: shape $H\times W \times D$ (e.g., $8 \times 8 \times 4$ )
Core	Spatial and depth-wise transformers; AR prediction of code stacks
Output	Decoded image via RQ-VAE decoder from generated code stacks
Efficiency	$O(T^2 + T D^2)$ per layer, enabling faster/larger-batch inference
Performance	State-of-the-art FID and IS for unconditional/conditional generation

Draft-and-Revise—Contextual RQ-Transformer (Lee et al., 2022): leverages masked infilling and global context for finer quality-diversity control.
DnD-Transformer (Chen et al., 2 Oct 2024): generalizes autoregression to a unified two-dimensional (depth plus sequence) prediction, surpassing RQ-Transformer in fidelity and efficiency at similar sequence length.
Mogo (Fu, 5 Dec 2024): extends RQ-Transformer principles to hierarchical causal generation of 3D motion, achieving state-of-the-art streaming synthesis.

RQ-Transformer integrates stacked residual quantization and two-level transformer modeling. This results in a compact, information-rich code sequence enabling efficient, high-resolution autoregressive generation, demonstrably advancing rate-distortion limits, sampling speed, and unconditional/conditional image generation benchmark performance (Lee et al., 2022).