Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Codebook Quantization

Updated 4 February 2026
  • Multi-Codebook Quantization is a vector quantization method that encodes high-dimensional vectors by summing codewords from multiple distinct codebooks, offering improved efficiency and lower quantization error.
  • It employs various approaches such as additive, product, and residual quantization to optimize code utilization and balance computational load in applications like model compression and similarity search.
  • MCQ underpins advances in neural network compression, semantic communication, and federated learning by enabling significant storage savings, robust performance, and adaptive codebook specialization.

Multi-Codebook Quantization (MCQ) is a broad family of vector quantization architectures in which multiple distinct codebooks are used to discretize and reconstruct high-dimensional vectors. Unlike single-codebook quantization, where each input vector is represented via a nearest neighbor in a single codebook, MCQ encodes a vector as a function—typically a sum—of codewords drawn from two or more codebooks. This approach underpins modern advances in model compression, representation learning, communication systems, large-scale retrieval, and knowledge distillation by enabling lower quantization error, improved codebook utilization, adaptive robustness, and more efficient computation.

1. Mathematical Formulation and Variants

Let xRdx\in\mathbb{R}^d be the vector to quantize, and suppose we have MM codebooks {C(1),,C(M)}\{C^{(1)},\ldots,C^{(M)}\}, where C(m)RK×dC^{(m)}\in\mathbb{R}^{K\times d} (for KK codewords per codebook). MCQ expresses xx as either

  • an additive composition:

x^=m=1Mcem(m),em{1,,K}\hat x = \sum_{m=1}^M c^{(m)}_{e_m},\quad e_m\in\{1,\ldots,K\}

(Additive Quantization, AQ, and general MCQ (Egiazarian et al., 2024, Vallaeys et al., 6 Jan 2025)),

  • or via partitioning xx into MM disjoint subvectors and quantizing each separately:

x=[x(1),,x(M)],x^=[ce1(1),,ceM(M)]x = [x^{(1)},\ldots,x^{(M)}],\quad \hat x = [c^{(1)}_{e_1}, \ldots, c^{(M)}_{e_M}]

(Product Quantization, PQ, and Multi-Codebook PQ (Yang et al., 2024)).

Another frequent MCQ instance is Residual Quantization (RQ), where quantization proceeds sequentially: at each stage mm, one quantizes the residual r(m)=xi=1m1cei(i)r^{(m)}=x-\sum_{i=1}^{m-1}c^{(i)}_{e_i} with codebook C(m)C^{(m)} (Huijben et al., 2024, Vallaeys et al., 6 Jan 2025).

Hybrid and specialized MCQ structures are used for cross-level semantics (Wu et al., 19 Jan 2025, Malidarreh et al., 13 Mar 2025), per-token or per-group assignment (Wang et al., 27 Oct 2025), or channel-adaptive assignments (Shin et al., 16 Apr 2025).

2. Training Algorithms and Quantization Strategies

The objective is to minimize the distortion,

E(x)=xx^22E(x) = \|x - \hat x\|_2^2

under constraints on total code length and computational cost. Several approaches are employed:

  • Alternating minimization: Alternates between updating the codebooks (often via k-means or least-squares regression) and updating the code assignments (via maximum-likelihood, beam search, or greedy assignment). This guarantees monotonic improvement in quantization error (Egiazarian et al., 2024, Guo et al., 2022, Vallaeys et al., 6 Jan 2025).
  • Neural residual codebooks: Each stage's codeword is generated by an MLP conditioned on the partial reconstruction so far, rather than by a fixed table. This provides context-aware codebook specialization and improved rate-distortion (Huijben et al., 2024, Vallaeys et al., 6 Jan 2025).
  • Beam search and codeword pre-selection: To address the exponential search space in AQ or neural MCQ, encoding can be accelerated with codeword pre-selection (via shallow selectors) and beam search for joint codebook assignment (Vallaeys et al., 6 Jan 2025).
  • Straight-through estimator and Gumbel-softmax: For end-to-end differentiable MCQ modules, gradients are propagated through discrete assignments by replacing argmax with a straight-through estimator, or using soft relaxations (You et al., 22 Dec 2025, Shin et al., 16 Apr 2025).

3. Architectural Designs and Practical MCQ Variants

The following table summarizes principal MCQ instantiations in representative domains:

Variant Codebook Coupling Quantization Rule Key Application Domains
Additive Quantization Full, non-orthogonal Sum of MM codewords Model compression, vector search
Product Quantization Disjoint subvectors Subvector to codeword Retrieval, federated learning
Residual Quantization Residual chaining Residual per stage Large-scale search, compression
Dual-Codebook (VQ) Channel-wise split Sum of global+local code Image/point cloud reconstruction
Switchable/Token-Spec. Per-token/group assigned Indexed per token/group Face compression, adaptive coding
Neural (QINCo/Qinco2) Context-dependent MLP MLP codeword adjustment Neural search, extreme compression

Examples:

  • Dual Codebook VQ partitions feature channels into global and local halves, each quantized by a separate codebook with distinct update mechanisms (transformer for global, EMA for local), and summed for reconstruction (Malidarreh et al., 13 Mar 2025).
  • Token-specific MCQ assigns different codebooks to different spatial/semantic tokens, with a routing module minimizing code length per token while enhancing expressivity (Wang et al., 27 Oct 2025).
  • ESC-MVQ uses multiple codebooks tailored to varying channel conditions, with per-symbol adaptation and power allocation optimized for end-to-end semantic communication (Shin et al., 16 Apr 2025).
  • QINCo2 employs implicit, neural, context-adaptive codebooks with beam search for code assignment, achieving state-of-the-art MSE and recall (Vallaeys et al., 6 Jan 2025).

4. MCQ in Knowledge Distillation and Representation Compression

In large-scale knowledge distillation, MCQ is used to discretize teacher embeddings into compact code index sequences, producing hard labels that student models predict via cross-entropy loss (Guo et al., 2022, You et al., 22 Dec 2025). This approach yields:

  • Extreme storage savings: Reducing per-frame storage of teacher signals by 128–256×\times (e.g., 4096 bytes \rightarrow 16 bytes for D=512D=512, M=16M=16, K=256K=256) with negligible performance drop (Guo et al., 2022, You et al., 22 Dec 2025).
  • Streaming and low-latency: Precomputed MCQ indices enable fast student training, avoiding repeated teacher inference (You et al., 22 Dec 2025).
  • Task robustness: Empirical ablations show maintained or improved WER and PER in ASR, even compared to 2\ell_2 regression (Guo et al., 2022, You et al., 22 Dec 2025).

In model compression, MCQ supports compression of massive model weights (e.g., LLAMA-13B) to below 3 bits/parameter, with block-wise joint optimization and input-aware fine-tuning (AQLM (Egiazarian et al., 2024)). Jointly learnable multi-codebooks with clustering and structured mapping enable LLMs to be deployed on resource-constrained devices while retaining up to 95% accuracy (Yvinec et al., 2023).

5. Applications in Communication, Search, and Federated Learning

MCQ architectures are used in both semantic communication systems and large-scale similarity search:

  • Semantic communication: MCQ variants such as multilevel codebook + RVQ (Zhou et al., 2024) and ESC-MVQ (Shin et al., 16 Apr 2025) enable bandwidth-efficient digital transmission by closely matching codebook structure to channel constraints. Multi-head, small-arity codebooks map to QAM constellations for robust performance at fixed or adaptive SNR (Zhou et al., 2024).
  • Federated learning: FedMPQ applies MCQ with product quantization for secure, communication-efficient aggregation of client updates. Multiple server-maintained, client-informed codebooks are used per round, with selection via reconstruction error minimization and in-TEE aggregation to guarantee privacy and convergence under non-IID data (Yang et al., 2024).
  • Nearest neighbor search: QINCo/QINCo2 MCQ with context-conditioned neural decoders improve compression and top-1 recall, outperforming previous PQ/RQ approaches by 24–34% in recall and MSE (Huijben et al., 2024, Vallaeys et al., 6 Jan 2025).

6. Empirical Performance, Limitations, and Outlook

MCQ architectures have established new state-of-the-art results in diverse domains:

Limitations include increased encoding complexity (especially with beam search), potential bottlenecks in codebook size growth (for per-token/group MCQ), and the requirement for large, representative training data sets for maximum efficacy (Vallaeys et al., 6 Jan 2025, Wang et al., 27 Oct 2025). Future directions highlighted in these works encompass dynamic bit-allocation, higher-order dependency modeling (beyond pairs), efficient codebook architectures, and further extensions to activations or hybrid quantization paradigms (Vallaeys et al., 6 Jan 2025, Egiazarian et al., 2024).

7. Comparisons to Single-Codebook Schemes and Design Considerations

MCQ decisively overcomes several limitations of single-codebook quantization methods:

Practitioners must consider codebook cardinality, the assignment/routing mechanism, algorithmic scalability, and per-application trade-offs between codebook specialization and assignment overhead (Wang et al., 27 Oct 2025, Yang et al., 2024).


Multi-Codebook Quantization stands as a foundational paradigm for efficient, adaptive, and high-fidelity vector discretization across modern machine learning, communication, and information retrieval systems. Its evolving forms—ranging from additive, product, and residual quantization to neural, switchable, and adaptive codebook architectures—continue to advance the frontiers of model compression, representation learning, and signal processing.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Codebook Quantization (MCQ).