Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Residual Vector Quantization (RVQ)

Updated 22 June 2025

Residual vector quantization (RVQ) is an additive quantization technique in which a high-dimensional vector is encoded as the sum of codewords from multiple codebooks, each sequentially approximating the residual error left by previous stages. RVQ enables fine-grained compression, scalability in quantization fidelity, and flexibility in rate-distortion trade-offs. It has become foundational across large-scale retrieval, compression, generative modeling, neural codec design, and, most recently, efficient LLM systems.

1. Principles of Residual Vector Quantization

RVQ aims to approximate an input vector $\mathbf{x} \in \mathbb{R}^d$ as a sum of codewords drawn from $M$ codebooks: $\mathbf{x} \approx \sum_{m=1}^{M} \mathbf{c}_m(i_m(\mathbf{x}))$ where each codebook $\mathbf{C}_m = \{\mathbf{c}_m(1), ..., \mathbf{c}_m(K)\}$ contains $K$ codewords. The encoding at stage $m$ selects index $i_m$ so as to minimize the error between the current residual and candidate codewords.

Sequential Encoding:

At stage $m$ , the residual is

$\mathbf{e}^{(m)} = \mathbf{x} - \sum_{j=1}^{m-1} \mathbf{c}_j(i_j(\mathbf{x}))$

The next codeword minimizes

$i_m(\mathbf{x}) = \arg\min_k \|\mathbf{e}^{(m)} - \mathbf{c}_m(k)\|^2$

The process continues for $M$ stages, with the quantized output as the sum of selected codewords.

Capacity and Hierarchy:

RVQ exponentially increases the effective number of represented clusters, $K^M$ , with $M$ codebooks of size $K$ .
The approach supports variable quantization depth for adaptive bitrate (notably in audio codecs and generative models).

2. Algorithmic Advances and Modern Architectures

Improved and Generalized RVQ

Classical RVQ encounters limitations:

Performance gain diminishes after a few stages, due to vanishing residual structure.
Standard greedy encoding can be suboptimal (NP-hard for global optimality).

IRVQ (Liu et al., 2015 ):

Employs hybrid codebook learning: subspace PCA-based clustering and warm-started k-means to maintain high codebook entropy at all stages.
Introduces multi-path encoding—akin to beam search—to reduce distortion, comparing not just the best path but multiple hypothesis encodings at each stage.

GRVQ (Liu et al., 2016 ):

Iterative codebook refinement: codebooks are revisited and updated using up-to-date residuals, not purely sequential optimization.
Transition clustering increases PCA subspace dimensionality stepwise, mitigating the curse of dimensionality.
Adds regularization to eliminate the need for $\epsilon$ -term corrections in Euclidean distance for similarity search.

Neural Codebook Extensions

QINCo (Douze et al., 26 Jan 2024 ) and similar methods leverage neural networks to generate codebook vectors per quantization cell and context, sidestepping fixed-codebook inefficiency in deep hierarchies and enabling prefix truncation/multi-rate operation.

3. Applications: Compression, Retrieval, and Generative Modeling

Neural Network Compression

RVQ, as examined in (Gong et al., 2014 ), is less effective than product quantization or scalar quantization for fully-connected layer compression in deep CNNs but illustrates the general principle: capturing additive redundancy per weight or feature vector offers strong compression with controlled accuracy drop.

Audio and Video Codecs

Audio Models:

Encodec, SoundStream, DAC, HiFi-Codec, and APCodec use RVQ to discretize latent spaces, with each codebook refining structure from coarse-to-fine.
Group-residual RVQ (Yang et al., 2023 ) parallelizes quantization across groups, leading to high fidelity at low codebook counts.

Enhancements:

ERVQ (Zheng et al., 16 Oct 2024 ) adds intra-codebook balancing (online clustering, code balancing loss) and inter-codebook diversity regularization (SSIM loss) to eliminate codebook collapse and optimize entropy, reaching 100% code utilization in challenging codecs.

Variable Rate and Specialized Structures:

VRVQ (Chae et al., 8 Oct 2024 ) enables per-frame variable codebook allocation, using an importance map and a straight-through gradient estimator for end-to-end training.
RSVQ (Jiang et al., 9 Apr 2025 ) in StreamCodec cascades scalar and vector quantizers over residuals, efficiently assigning coarse structure and fine detail for streamable, causal, real-time codecs.

Video Compression:

VQ-NeRV (Xu et al., 19 Mar 2024 ) uses codebook quantization of shallow (local detail) features and inter-frame residuals, with optimization to maximize codebook usage and reconstruct dynamic content.

Generative and Representation Learning

Generative Models:

RQ-VAE and RQ-Transformer (Lee et al., 2022 ) combine RVQ-based encoders with AR models for efficient high-resolution image generation, achieving short code sequences and high fidelity.
ResGen (Kim et al., 13 Dec 2024 ) employs RVQ tokens in a discrete diffusion process, predicting sums of masked token embeddings to enable efficient, parallel sampling—improving both speed and fidelity over AR baselines.

Music and Multimodal Representation:

MuQ (Zhu et al., 2 Jan 2025 ) introduces Mel-RVQ, a lightweight, residual linear tokenizer for Mel spectrograms, showing superior speed and stability for self-supervised music representation learning.
SRCID (Huang et al., 26 Dec 2024 ) generalizes the residual concept to semantic space: hierarchical disentanglement of modal-general and modal-specific features using quantization over semantic residuals, with mutual information minimization for alignment.

LLM Weights, Cache, and Model Merging:

VPTQ (Liu et al., 25 Sep 2024 ) adapts RVQ to extreme low-bit LLM weight quantization, using channel-independent second-order optimization and residual coding to support 2-bit deployments.
RVQ for KV cache quantization (Kumar, 21 Oct 2024 ) achieves 5.5× compression of LLM cache with minimal accuracy loss by grouping channels and performing depth-8 residual coding.
Task model merging via residual quantization (Kim et al., 10 Mar 2025 ) compresses task vectors decomposed into a shared base and per-task offset, enabling ultra-low memory multi-task checkpoint storage.

Memory and Retrieval:

TurboQuant (Zandieh et al., 28 Apr 2025 ) achieves near-optimal, data-oblivious vector quantization rates via random rotation (decorrelating features), optimal scalar quantization, and a QJL-augmented stage, applicable to both mean squared error and unbiased inner product preservation. TurboQuant's key advantage is codebook-free, streaming quantization with strong theoretical guarantees and empirical performance for nearest neighbor retrieval and LLM cache compression.

4. Mathematical Formulations and Optimization

The quantization at each residual stage generally follows: $\begin{aligned} \text{Initialize:}\quad & r^{(0)} = \mathbf{x} \ \text{For}\ m = 1,\ldots,M: \ & i_m = \arg\min_k \| r^{(m-1)} - c_m(k) \|^2 \ & r^{(m)} = r^{(m-1)} - c_m(i_m) \end{aligned}$ Total quantized output: $\hat{\mathbf{x}} = \sum_{m=1}^M c_m(i_m)$ .

Codebook learning (classic, as in (Liu et al., 2016 )) minimizes empirical quantization error: $E = \frac{1}{N} \sum_{\mathbf{x}} \left\| \mathbf{x} - \sum_{m=1}^M c_m(i_m(\mathbf{x})) \right\|^2$

Enhancements optimize:

Codebook independence and usage entropy: maximize mutual information between codebooks (IRVQ, ERVQ).
Balancing inter/intra-modal semantic information (SRCID) via mutual information minimization/maximization.

5. Performance Metrics and Practical Implications

Compression Quality and Retrieval:

In ANN search (Liu et al., 2015 , Liu et al., 2016 , Douze et al., 26 Jan 2024 ), modern RVQ-based methods achieve higher recall and lower distortion across SIFT1M, GIST1M, FB-ssnpp, Contriever and Deep1B datasets than product quantization and additive quantization, with scalable training and encoding time.

Generative Modeling:

On ImageNet 256×256, RVQ-based (multi-depth) diffusion models such as ResGen outperform previous AR methods in both fidelity (FID=1.95 with classifier-free guidance) and sampling speed due to multi-token prediction and group-wise embedding regression.
For audio codecs (HiFi-Codec, ERVQ-enhanced Encodec, StreamCodec), RVQ variants produce perceptually superior reconstruction (ViSQOL 4.3 at 1.5 kbps, subjective listener preference), with codebook utilization approaching 100%.

LLM Efficiency:

LLM weight quantization with RVQ (VPTQ) attains perplexity and QA accuracy comparable or superior to SOTA 2-bit methods (up to 22% gain on Llama3), with 1.6–1.8× faster inference and 10× faster quantization time than SOTA.
KV cache quantization with RVQ yields 5.5× memory reduction at minimal loss.

6. Broader Impact, Generalization, and Future Directions

Theoretical optimality: TurboQuant approaches information-theoretic lower bounds on distortion for both MSE and inner product, by reducing high-dimensional quantization to a random-rotation-invariant, coordinate-wise optimal process.
Online scalability: Data-oblivious RVQ, as exemplified in streaming codecs and TurboQuant, supports stateless, parallel, hardware-friendly deployment.
Semantic extension: By shifting from arithmetic to semantic residuals, SRCID reconciles cross-modal alignment and quantization fidelity, currently outperforming RVQ-based approaches in zero-shot multimodal retrieval tasks.
Adaptive granularity and bitrate: Variable-depth (VRVQ), group-wise (GRVQ), and multi-token RVQ structures enable locally adaptive, content-aware quantization settings—foundational for efficient, high-fidelity model deployment in generative modeling and communication.

Summary Table: Modern RVQ and Extensions

Aspect	Classic RVQ	Improved/Generalized	Neural/Semantic Extensions
Codebook Learning	k-means on residuals	Subspace/warm start, iterative	Context-dependent neural, EMA, random rotation
Encoding	Greedy sequential	Multi-path, beam, variable rate	Embedding regression, QJL, semantic MI
Collapse/Utilization	Often poor	Addressed via entropy loss	Explicit balancing, clustering, regularization
Target Domains	Retrieval, Compression	Audio, Video, LLM, Model merging	Multimodal, generative, memory-efficient LLM
Notable Papers/Tools	(Liu et al., 2015 , Liu et al., 2016 )	(Yang et al., 2023 , Chae et al., 8 Oct 2024 , Kim et al., 13 Dec 2024 )	(Douze et al., 26 Jan 2024 , Zheng et al., 16 Oct 2024 , Huang et al., 26 Dec 2024 , Zandieh et al., 28 Apr 2025 )

PDF Markdown Bookmark Chat (Pro)