Multi-Codebook Quantization

Updated 4 February 2026

Multi-Codebook Quantization is a vector quantization method that encodes high-dimensional vectors by summing codewords from multiple distinct codebooks, offering improved efficiency and lower quantization error.
It employs various approaches such as additive, product, and residual quantization to optimize code utilization and balance computational load in applications like model compression and similarity search.
MCQ underpins advances in neural network compression, semantic communication, and federated learning by enabling significant storage savings, robust performance, and adaptive codebook specialization.

Multi-Codebook Quantization (MCQ) is a broad family of vector quantization architectures in which multiple distinct codebooks are used to discretize and reconstruct high-dimensional vectors. Unlike single-codebook quantization, where each input vector is represented via a nearest neighbor in a single codebook, MCQ encodes a vector as a function—typically a sum—of codewords drawn from two or more codebooks. This approach underpins modern advances in model compression, representation learning, communication systems, large-scale retrieval, and knowledge distillation by enabling lower quantization error, improved codebook utilization, adaptive robustness, and more efficient computation.

1. Mathematical Formulation and Variants

Let $x\in\mathbb{R}^d$ be the vector to quantize, and suppose we have $M$ codebooks $\{C^{(1)},\ldots,C^{(M)}\}$ , where $C^{(m)}\in\mathbb{R}^{K\times d}$ (for $K$ codewords per codebook). MCQ expresses $x$ as either

an additive composition:

$\hat x = \sum_{m=1}^M c^{(m)}_{e_m},\quad e_m\in\{1,\ldots,K\}$

(Additive Quantization, AQ, and general MCQ (Egiazarian et al., 2024, Vallaeys et al., 6 Jan 2025)),

or via partitioning $x$ into $M$ disjoint subvectors and quantizing each separately:

$x = [x^{(1)},\ldots,x^{(M)}],\quad \hat x = [c^{(1)}_{e_1}, \ldots, c^{(M)}_{e_M}]$

(Product Quantization, PQ, and Multi-Codebook PQ (Yang et al., 2024)).

Another frequent MCQ instance is Residual Quantization (RQ), where quantization proceeds sequentially: at each stage $m$ , one quantizes the residual $r^{(m)}=x-\sum_{i=1}^{m-1}c^{(i)}_{e_i}$ with codebook $C^{(m)}$ (Huijben et al., 2024, Vallaeys et al., 6 Jan 2025).

Hybrid and specialized MCQ structures are used for cross-level semantics (Wu et al., 19 Jan 2025, Malidarreh et al., 13 Mar 2025), per-token or per-group assignment (Wang et al., 27 Oct 2025), or channel-adaptive assignments (Shin et al., 16 Apr 2025).

2. Training Algorithms and Quantization Strategies

The objective is to minimize the distortion,

$E(x) = \|x - \hat x\|_2^2$

under constraints on total code length and computational cost. Several approaches are employed:

Alternating minimization: Alternates between updating the codebooks (often via k-means or least-squares regression) and updating the code assignments (via maximum-likelihood, beam search, or greedy assignment). This guarantees monotonic improvement in quantization error (Egiazarian et al., 2024, Guo et al., 2022, Vallaeys et al., 6 Jan 2025).
Neural residual codebooks: Each stage's codeword is generated by an MLP conditioned on the partial reconstruction so far, rather than by a fixed table. This provides context-aware codebook specialization and improved rate-distortion (Huijben et al., 2024, Vallaeys et al., 6 Jan 2025).
Beam search and codeword pre-selection: To address the exponential search space in AQ or neural MCQ, encoding can be accelerated with codeword pre-selection (via shallow selectors) and beam search for joint codebook assignment (Vallaeys et al., 6 Jan 2025).
Straight-through estimator and Gumbel-softmax: For end-to-end differentiable MCQ modules, gradients are propagated through discrete assignments by replacing argmax with a straight-through estimator, or using soft relaxations (You et al., 22 Dec 2025, Shin et al., 16 Apr 2025).

3. Architectural Designs and Practical MCQ Variants

The following table summarizes principal MCQ instantiations in representative domains:

Variant	Codebook Coupling	Quantization Rule	Key Application Domains
Additive Quantization	Full, non-orthogonal	Sum of $M$ codewords	Model compression, vector search
Product Quantization	Disjoint subvectors	Subvector to codeword	Retrieval, federated learning
Residual Quantization	Residual chaining	Residual per stage	Large-scale search, compression
Dual-Codebook (VQ)	Channel-wise split	Sum of global+local code	Image/point cloud reconstruction
Switchable/Token-Spec.	Per-token/group assigned	Indexed per token/group	Face compression, adaptive coding
Neural (QINCo/Qinco2)	Context-dependent MLP	MLP codeword adjustment	Neural search, extreme compression

Examples:

Dual Codebook VQ partitions feature channels into global and local halves, each quantized by a separate codebook with distinct update mechanisms (transformer for global, EMA for local), and summed for reconstruction (Malidarreh et al., 13 Mar 2025).
Token-specific MCQ assigns different codebooks to different spatial/semantic tokens, with a routing module minimizing code length per token while enhancing expressivity (Wang et al., 27 Oct 2025).
ESC-MVQ uses multiple codebooks tailored to varying channel conditions, with per-symbol adaptation and power allocation optimized for end-to-end semantic communication (Shin et al., 16 Apr 2025).
QINCo2 employs implicit, neural, context-adaptive codebooks with beam search for code assignment, achieving state-of-the-art MSE and recall (Vallaeys et al., 6 Jan 2025).

4. MCQ in Knowledge Distillation and Representation Compression

In large-scale knowledge distillation, MCQ is used to discretize teacher embeddings into compact code index sequences, producing hard labels that student models predict via cross-entropy loss (Guo et al., 2022, You et al., 22 Dec 2025). This approach yields:

Extreme storage savings: Reducing per-frame storage of teacher signals by 128–256 $\times$ (e.g., 4096 bytes $\rightarrow$ 16 bytes for $D=512$ , $M=16$ , $K=256$ ) with negligible performance drop (Guo et al., 2022, You et al., 22 Dec 2025).
Streaming and low-latency: Precomputed MCQ indices enable fast student training, avoiding repeated teacher inference (You et al., 22 Dec 2025).
Task robustness: Empirical ablations show maintained or improved WER and PER in ASR, even compared to $\ell_2$ regression (Guo et al., 2022, You et al., 22 Dec 2025).

In model compression, MCQ supports compression of massive model weights (e.g., LLAMA-13B) to below 3 bits/parameter, with block-wise joint optimization and input-aware fine-tuning (AQLM (Egiazarian et al., 2024)). Jointly learnable multi-codebooks with clustering and structured mapping enable LLMs to be deployed on resource-constrained devices while retaining up to 95% accuracy (Yvinec et al., 2023).

5. Applications in Communication, Search, and Federated Learning

MCQ architectures are used in both semantic communication systems and large-scale similarity search:

Semantic communication: MCQ variants such as multilevel codebook + RVQ (Zhou et al., 2024) and ESC-MVQ (Shin et al., 16 Apr 2025) enable bandwidth-efficient digital transmission by closely matching codebook structure to channel constraints. Multi-head, small-arity codebooks map to QAM constellations for robust performance at fixed or adaptive SNR (Zhou et al., 2024).
Federated learning: FedMPQ applies MCQ with product quantization for secure, communication-efficient aggregation of client updates. Multiple server-maintained, client-informed codebooks are used per round, with selection via reconstruction error minimization and in-TEE aggregation to guarantee privacy and convergence under non-IID data (Yang et al., 2024).
Nearest neighbor search: QINCo/QINCo2 MCQ with context-conditioned neural decoders improve compression and top-1 recall, outperforming previous PQ/RQ approaches by 24–34% in recall and MSE (Huijben et al., 2024, Vallaeys et al., 6 Jan 2025).

6. Empirical Performance, Limitations, and Outlook

MCQ architectures have established new state-of-the-art results in diverse domains:

Image/text: Dual Codebook VQ achieves FID improvements of 4–7 points and up to 1 dB PSNR gain versus same-memory single-codebook VQ (Malidarreh et al., 13 Mar 2025).
Speech: MCQ-based KD reduces WER and PER, with 128–256 $\times$ storage savings (Guo et al., 2022, You et al., 22 Dec 2025).
Model compression: Pareto-optimality in accuracy versus size is reached below 3 bits/weight in LLMs (Egiazarian et al., 2024, Yvinec et al., 2023).
Communication: Bandwidth and distortion efficiency are improved under challenging channel conditions via codebook/channel co-design (Shin et al., 16 Apr 2025, Zhou et al., 2024).
Retrieval: MCQ with neural decoders yields MSE and recall@1 improvements of up to 34% and 24%, respectively, over RQ/PQ (Huijben et al., 2024, Vallaeys et al., 6 Jan 2025).

Limitations include increased encoding complexity (especially with beam search), potential bottlenecks in codebook size growth (for per-token/group MCQ), and the requirement for large, representative training data sets for maximum efficacy (Vallaeys et al., 6 Jan 2025, Wang et al., 27 Oct 2025). Future directions highlighted in these works encompass dynamic bit-allocation, higher-order dependency modeling (beyond pairs), efficient codebook architectures, and further extensions to activations or hybrid quantization paradigms (Vallaeys et al., 6 Jan 2025, Egiazarian et al., 2024).

7. Comparisons to Single-Codebook Schemes and Design Considerations

MCQ decisively overcomes several limitations of single-codebook quantization methods:

Code utilization: MCQ achieves close to 100% code utilization versus as low as 40–60% in high-cardinality single-codebook schemes (Malidarreh et al., 13 Mar 2025, Wang et al., 27 Oct 2025).
Rate-distortion superiority: Additive or residual sum MCQ achieves dramatically lower distortion for the same number of bits/indices (Egiazarian et al., 2024, Vallaeys et al., 6 Jan 2025, Guo et al., 2022).
Adaptive robustness: By training multiple codebooks for different features, tokens, channel conditions, or residuals, MCQ architectures offer situational or spatially-aware adaptation versus the “one-size-fits-all” limitation of global codebooks (Shin et al., 16 Apr 2025, Wang et al., 27 Oct 2025).
Computation and storage: Memory and compute cost can be closely matched to single-codebook approaches via strategic dimension/channel splitting or codebook pooling (see Dual-Codebook VQ (Malidarreh et al., 13 Mar 2025)).

Practitioners must consider codebook cardinality, the assignment/routing mechanism, algorithmic scalability, and per-application trade-offs between codebook specialization and assignment overhead (Wang et al., 27 Oct 2025, Yang et al., 2024).

Multi-Codebook Quantization stands as a foundational paradigm for efficient, adaptive, and high-fidelity vector discretization across modern machine learning, communication, and information retrieval systems. Its evolving forms—ranging from additive, product, and residual quantization to neural, switchable, and adaptive codebook architectures—continue to advance the frontiers of model compression, representation learning, and signal processing.