Residual Neural Quantization (QINCo/QINCo2)

Updated 20 March 2026

Residual Neural Quantization (QINCo/QINCo2) is a data-adaptive approach that replaces static codebooks with neural networks producing implicit centroids conditioned on past quantizations.
The method refines centroids through learned corrective terms and cross-stage attention, significantly improving rate-distortion performance and operational efficiency.
Its integration in applications like neural codecs and billion-scale ANN search demonstrates practical gains in reconstruction error reduction and search recall enhancement.

Residual Neural Quantization (QINCo/QINCo2) encompasses a family of data-adaptive, multi-stage quantization methods that replace the static codebooks of classical residual quantization (RQ) with small neural networks producing “implicit” codebooks conditioned on the quantized history. These methods achieve state-of-the-art performance for vector compression, approximate nearest neighbor (ANN) search, and neural codec design, notably advancing both rate-distortion and operational efficiency at scale.

1. Mathematical Formulation of Residual Neural Quantization

Let $x \in \mathbb{R}^D$ be a target vector to be quantized using $N$ sequential codebooks. Classical RQ represents $x$ as the sum $\hat{x}_{N+1} = \sum_{n=1}^{N} c_n^{k_n}$ , where each $c_n^{k_n}$ is a centroid selected from a fixed codebook $C_n = \{c_n^1,\dots,c_n^K\}$ at stage $n$ based on the current residual $r_n=x-\hat{x}_n$ : $k_n = \arg\min_{k\in\{1,\dots,K\}} \| r_n - c_n^k \|_2^2,$

$\hat{x}_{n+1} = \hat{x}_n + c_n^{k_n}.$

In QINCo/QINCo2, each centroid is produced by a neural network $f_{\theta_n}$ , conditioned on the intermediate reconstruction $\hat{x}_n$ and a base centroid $\bar{c}_n^k$ : $c_n^k = f_{\theta_n}\left(\hat{x}_n, \bar{c}_n^k\right).$ The selection rule is retained, but the centroid itself is adaptive: $k_n = \arg\min_{k} \| r_n - c_n^k \|_2^2,$

$\hat{x}_{n+1} = \hat{x}_n + c_n^{k_n}.$

This neuralization of codebooks transforms residual quantization into a parameter-rich, data-dependent process, where the codebook at each stage is implicitly a function of the quantization path and the input statistics (Huijben et al., 2024, Vallaeys et al., 6 Jan 2025, Lahrichi et al., 19 Mar 2025).

2. Implicit Neural Codebooks: Architecture and Parameterization

The implicit codebook for stage $n$ is generated as follows:

Each $\bar{c}_n^k \in \mathbb{R}^D$ is a base centroid (typically from k-means) and is fixed or fine-tuned.
The neural network $f_{\theta_n}(\cdot)$ receives $[\hat{x}_n;\bar{c}_n^k] \in \mathbb{R}^{2D}$ as input and outputs a refined centroid.
The architecture for $f_{\theta_n}$ $f_{θ_{n}}$ consists of:
- An initial affine layer projecting from $\mathbb{R}^{2D} \rightarrow \mathbb{R}^{d_h}$ .
- $L$ residual MLP blocks (width $d_h$ ).
- A final affine projection $d_h \rightarrow D$ .
- A skip connection ensures $f_{\theta_n}(\hat{x}_n,\bar{c}_n^k) = \bar{c}_n^k + h(\hat{x}_n,\bar{c}_n^k)$ , so the network learns a corrective term.

QINCo2 further refines this blueprint by:

Sharing/tying parameters across stages to lower memory cost.
Incorporating cross-stage attention, so each stage’s MLP can depend on all prior residuals, not only $\hat{x}_n$ .
Improved base centroid initialization and warm-starts.
Enhanced lookup speed through efficient architecture design (Vallaeys et al., 6 Jan 2025, Lahrichi et al., 19 Mar 2025).

3. Training Objectives and Algorithms

For end-to-end training, QINCo and QINCo2 optimize the sum-squared quantization error across all residual stages over all training samples: $\mathcal{L}_{\mathrm{quant}} = \sum_{x \in X} \sum_{n=1}^{N} \| r_n(x) - c_n^{k_n(x)} \|_2^2.$ Key points:

The quantization indices $k_n$ are obtained via hard nearest-neighbor search; no commitment loss or straight-through estimator is required.
During training, gradients are backpropagated only into the parameters of $f_{\theta_n}$ (and optionally the base centroids).
In pipeline applications such as neural codecs (QINCODEC), encoder and quantizer are frozen when the decoder is fine-tuned; only the decoder receives gradients (Lahrichi et al., 19 Mar 2025).

The greedy (or beam) search during encoding selects, at each stage, the index $k_n$ minimizing the distortion to the current residual; decoding reconstructs $x$ from the sum of the selected refined centroids. In QINCo2, beam search with codeword pre-selection is used: a lightweight scorer $g_{\phi_n}$ pre-selects a subset of candidates to minimize evaluation cost, and a beam of width $B$ maintains multiple partial reconstructions for improved accuracy at higher computational load (Vallaeys et al., 6 Jan 2025).

4. Pipeline Integration and Applications

QINCo/QINCo2 integrate directly into modular pipelines such as QINCODEC for neural audio compression (Lahrichi et al., 19 Mar 2025):

Autoencoder pretraining: A continuous autoencoder is trained on the raw domain (e.g., waveforms for audio) with spectral and adversarial losses, no quantization bottleneck.
Offline quantizer training: The pre-trained encoder produces a large set of latent representations. The QINCo2 quantizer is fit on this dataset, selecting the number of stages $N$ and codebook size $K$ to meet target bitrate via $bits/sec = F \cdot N \cdot \log_2 K$ , where $F$ is frame-rate.
Decoder fine-tuning: With encoder and quantizer fixed, the decoder is fine-tuned using quantized latents. This stage restores fidelity lost to quantization, training only the decoder and discriminator.

In large-scale vector search, the QINCo2 encoding and decoding steps are adapted for billion-scale nearest neighbor indices. To expedite decoding for retrieval, pairwise additive decoders are trained on pairs of indices, approximating the neural decoded vector as a sum of a small set of table-lookup terms with minimal accuracy loss (Vallaeys et al., 6 Jan 2025).

5. Empirical Results and Impact

QINCo2 consistently improves rate–distortion performance, ANN search recall, and codebook utilization compared to previous methods. Representative results (Vallaeys et al., 6 Jan 2025, Lahrichi et al., 19 Mar 2025):

Dataset	Code	Metric	RQ	QINCo	QINCo2
BigANN1M	16 B/v	MSE × 1e-4	1.30	0.32	0.18
Deep1M	8 B/v	Recall@1 (%)	21.4	36.3	45.1
QINCODEC (audio)	16 kbps	Si-SDR (dB)	6.09	—	7.22
		MS-Mel	0.96	—	0.79

On vector datasets, QINCo2 reduces reconstruction MSE by up to 34% over QINCo and raises Recall@1 by 24% for high-compression settings.
On audio, replacing RVQ with QINCo2 in QINCODEC increases reconstructed SDR by ~1 dB and decreases mel error by ~0.2 across bitrates, with higher codebook perplexity indicating better codebook usage.
QINCo2’s improvements are robust to overparameterization, larger data regimes, and various modality types (vision, speech, text embeddings) (Vallaeys et al., 6 Jan 2025, Lahrichi et al., 19 Mar 2025).

6. Extensions, Limitations, and Practical Considerations

Extensions:

QINCo2’s architecture scales to dynamic multi-rate compression by truncating the number of residual stages at decode-time with little MSE or recall loss.
Further gains are observed by extending from pairwise to higher-order combinatorial decoders, though at increased storage and training complexity.

Operational Trade-offs:

Encoding speed is necessarily slower than fixed RQ due to network evaluations, but pre-selection and beam search trade accuracy for speed.
Decoding via the full neural stack takes microseconds, but pairwise additive decoders enable near-classical lookup efficiency with minimal distortion loss (Vallaeys et al., 6 Jan 2025).

Limitations and Open Challenges:

Optimal design of pre-selection/beam strategies for variable-bitrate and low-power scenarios remains active research.
Memory footprint of large, adaptive neural codebooks must be balanced against accuracy, especially for extremely high-dimensional data.
Extending implicit neural codebooks to triplet- or higher-order recombinations could yield further accuracy, but increases index complexity and training overhead.

A plausible implication is that neural residual quantizers with implicit codebooks will increasingly supplant fixed-codebook baselines in high-fidelity compression and billion-scale retrieval, particularly as modeling and hardware advances further reduce inference and decode costs.

7. Relation to High-Order Residual Quantization in Network Acceleration

The QINCo2 methodology is closely related to high-order residual quantization (HORQ) in network binarization (Li et al., 2017). In HORQ, any vector $x \in \mathbb{R}^n$ is approximated by a sum of $K$ signed binary vectors with decreasing residual energy: $x \approx \sum_{i=1}^K \alpha_i b_i,$ with recursive updates $R_i(x) = R_{i-1}(x) - \alpha_i b_i$ , where $b_i = \operatorname{sign}(R_{i-1}(x))$ and $\alpha_i = \frac{1}{n} \| R_{i-1}(x) \|_1$ .

QINCo2 generalizes this approach from binarized settings to vector quantization with data-adaptive neural codebooks, retaining the greedy residual structure but introducing complex, context-dependent centroids and learned mapping functions. In the context of neural network acceleration, the same principle enables layers that operate with multiple binary maps, providing improved accuracy–speed trade-offs by capturing more residual detail with each quantization step (Li et al., 2017).

The development of QINCo and QINCo2 represents a major advance in residual vector quantization by introducing flexible, data-conditioned, and tractable codebooks, leading to superior empirical performance in both lossy compression and approximate search (Huijben et al., 2024, Vallaeys et al., 6 Jan 2025, Lahrichi et al., 19 Mar 2025).