RQ-KMeans: Hierarchical Residual Quantization

Updated 2 February 2026

Residual Quantization (RQ)-KMeans is a hierarchical vector quantization method that iteratively refines data approximations using multi-stage k-means clustering.
It integrates techniques like variance regularization, beam search, and local transformations to minimize quantization error and enhance codebook efficiency.
This approach balances computational complexity and storage requirements while leveraging rate-distortion theory to improve tasks such as ANN search and image restoration.

Residual Quantization (RQ)-KMeans is a hierarchical vector quantization technique in which multiple stages of k-means clustering are applied to iteratively quantize the residuals of previous approximations, enabling reduced distortion in representation of high-dimensional data and increased codebook efficiency. RQ-KMeans and its regularized and transformed variants have become central in tasks like approximate nearest neighbor (ANN) search, self-supervised representation learning, and image restoration, leveraging principles from rate-distortion theory and statistical models of data decorrelation.

1. Mathematical Formulation and Core Algorithm

Residual Quantization (RQ) models a data vector $x \in \mathbb{R}^d$ by a sum of codewords selected from $L$ codebooks, one per stage:

$\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$

where $c^{(\ell)}_{k^{(\ell)}(x)}$ is the codeword assigned to $x$ at stage $\ell$ . The residual for stage $\ell$ is defined recursively:

$r^{(0)} := x, \qquad r^{(\ell)} := r^{(\ell-1)} - c^{(\ell)}_{k^{(\ell)}(x)}$

At each stage, k-means is used to minimize the squared residual error:

$\min_{C^{(\ell)}, k^{(\ell)}} \sum_{x} \| r^{(\ell-1)}(x) - c^{(\ell)}_{k^{(\ell)}(x)} \|_2^2$

This process is repeated over all $L$ layers, with each assignment and codebook update step performed by classic k-means clustering. The total quantization error is:

$L$ 0

This greedy layerwise approach ensures each subsequent codebook approximates the residual left by the previous stages, with assignments and centroid updates decoupled across data points and stages (Ferdowsi et al., 2017, Nguyen et al., 4 Feb 2025, Liu et al., 2015, Yuan et al., 2015).

2. Hierarchical Codebook Training and Encoding

The hierarchical nature of RQ enables distributed codebook capacity:

Stagewise Codebook Learning: Each codebook $L$ 1 is learned from residuals $L$ 2 using k-means. Initial residuals are simply the original data vectors; subsequent stages use the quantization residual from prior stages (Nguyen et al., 4 Feb 2025, Liu et al., 2015).
Greedy Encoding: A vector is encoded by sequentially selecting nearest codeword indices at each stage, cumulatively representing the input by the sum of selected codewords.
Multi-Path / Beam Search Encoding: While greedy assignment yields efficient but suboptimal encoding, beam search maintains $L$ 3 candidate representations per vector, selecting codeword sequences that minimize total distortion, at polynomial computational cost. Exact encoding is NP-hard due to cross-stage interaction terms (Liu et al., 2015).

Method	Complexity per Sample	Encoding Quality
Greedy k-means	$L$ 4 or $L$ 5	Suboptimal, no cross-term
Beam Search (width $L$ 6)	$L$ 7	Near-optimal, lower error

Stagewise k-means codebook learning and encoding form the computational backbone of most RQ frameworks (Liu et al., 2015, Nguyen et al., 4 Feb 2025).

3. Variance-Regularization and Rate-Distortion Theory

Regularized Residual Quantization (RRQ) augments RQ-KMeans with a variance regularization term consistent with reverse water-filling principles in rate-distortion theory (Ferdowsi et al., 2017):

Variance Regularization: Augments k-means objective with a penalty

$L$ 8

where $L$ 9 is a diagonal matrix of target variances $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 0, $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 1 is a water-filling threshold, and $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 2 controls tradeoff between fit and variance matching.

Active Dimension Selection: At each layer, only dimensions with variance above threshold $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 3 are quantized, suppressing overfitting and enforcing sparsity in high dimensions.
Modified K-means Update: Assignment via nearest centroid on active dimensions, codebook update by minimizing regularized quartic terms, solved efficiently by per-row Newton steps.

This regularization yields sparse dictionaries and prevents overtraining when scaling to high $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 4 and deep $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 5 (Ferdowsi et al., 2017).

4. Data Preprocessing and Transformations

Effective RQ-KMeans depends on statistical whitening and decorrelation of input data, especially for natural images:

Global 2D-DCT: Applied to images for spectrum decay and energy compaction.
Subband Partition and PCA: DCT coefficients segregated into frequency bands; within each band, full-rank PCA decorrelates features, yielding approximately independent, variance-decaying coordinates.
Local Transforms in TRQ: Transformed Residual Quantization (TRQ) uses per-cluster orthogonal transformations (learned by orthogonal Procrustes analysis) at each stage to align cluster-specific residual subspaces before k-means, reducing overall quantization error (Yuan et al., 2015).

Such preprocessing ensures data is amenable to reverse-water-filling regularization and allows efficient RQ operation at scale (Ferdowsi et al., 2017, Yuan et al., 2015).

5. Computational and Storage Complexity

Storage and compute for RQ-KMeans scale with the number of codebooks, codewords, and dimensionality:

Assignment Step: At layer $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 6, cost $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 7 where $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 8 is the active dimension set.
Codebook Update: $\hat{x} = \sum_{\ell=1}^{L} c^{(\ell)}_{k^{(\ell)}(x)}$ 9 for regularized update.
Total (RRQ, $c^{(\ell)}_{k^{(\ell)}(x)}$ 0 layers): $c^{(\ell)}_{k^{(\ell)}(x)}$ 1 assignment and $c^{(\ell)}_{k^{(\ell)}(x)}$ 2 codebook update (summed over layers).
Test Encoding: $c^{(\ell)}_{k^{(\ell)}(x)}$ 3.
Storage: $c^{(\ell)}_{k^{(\ell)}(x)}$ 4 floats for codebooks.

TRQ introduces additional $c^{(\ell)}_{k^{(\ell)}(x)}$ 5 per codeword per level to store $c^{(\ell)}_{k^{(\ell)}(x)}$ 6 transformations, but remains modest compared to overall scale (Yuan et al., 2015, Ferdowsi et al., 2017).

6. Empirical Performance and Applications

RQ-KMeans and its variants exhibit improved codebook utilization, reduced distortion, and competitive downstream task performance:

Code Utilization: RQ achieves near 100% code usage per stage (CUR), avoiding dead codes prevalent in large single-codebook VQ (CUR often <21%) (Nguyen et al., 4 Feb 2025).
Quantization Quality: RRQ yields train distortion $c^{(\ell)}_{k^{(\ell)}(x)}$ 7 and test distortion $c^{(\ell)}_{k^{(\ell)}(x)}$ 8 on synthetic $c^{(\ell)}_{k^{(\ell)}(x)}$ 9 data, close to rate-distortion optimum $x$ 0. Standard k-means overfits (train $x$ 1, test $x$ 2) (Ferdowsi et al., 2017).
Nearest Neighbor Retrieval: Improved RVQ (IRVQ) with beam search and PCA-based warm start boosts Recall@1 and Recall@4 metrics (e.g., on SIFT1M, IRVQ Recall@1 $x$ 3 0.38 vs RVQ 0.32; Recall@4 $x$ 4 0.80 vs RVQ 0.70) (Liu et al., 2015).
Representation Learning: BRIDLE leverages RQ for self-supervised pretraining, outperforming VQ in audio, image, and video benchmarks (with effective code usage 0.03–0.05 for RQ vs 0.004–0.015 for single-codebook VQ) (Nguyen et al., 4 Feb 2025).
Image Restoration: RRQ restores high-frequency content in super-resolution tasks, reconstructing sharp facial images from low-res data via multi-layer codebooks (Ferdowsi et al., 2017).

Variant	Key Feature	Application
RRQ	Variance regularization	Super-resolution, decorrelated images
IRVQ	Hybrid codebook + beam	High-dimensional ANN search
TRQ	Per-cluster transforms	ANN with low distortion
BRIDLE	Hierarchical RQ	Self-supervised representation

In practice, these approaches combine statistical theory, efficient clustering, and practical heuristics for scalable high-accuracy quantization across diverse domains.

7. Limitations and Extensions

RQ-KMeans presents several inherent and practical challenges:

Greedy Encoding Suboptimality: Sequential encoding ignores cross-stage interactions, rendering the global optimum intractable (NP-hard). Beam search is a practical compromise yielding near-optimal encodings (Liu et al., 2015).
Performance Saturation: For classical RVQ, incremental distortion reduction wanes with increasing stages; improved variants address this by maintaining codebook entropy through hybrid training schemes.
Overfitting in High Dimensions: Vanilla RQ overtrains on variance-decaying, highly correlated data; variance regularization and careful preprocessing suppress this pathology (Ferdowsi et al., 2017).
Storage Overhead in TRQ: Local transforms increase memory requirements linearly with number of clusters and dimensions, limiting applicability in resource-constrained scenarios (Yuan et al., 2015).
Empirical Tuning: Hyperparameters such as number of layers, codebook size, regularization weights, and beam width demand empirical calibration for application-specific optimality.

Residual Quantization with k-means—augmented by variance regularization, subspace transforms, and multipath encoding—remains an active area of research, with ongoing advances in codebook design, scalable optimization, and integration into deep neural architectures (Ferdowsi et al., 2017, Nguyen et al., 4 Feb 2025, Yuan et al., 2015, Liu et al., 2015).