Sub-Codebooks for Quantization

Updated 13 July 2025

Sub-codebooks for quantization are structured partitions that improve representation efficiency by dividing large codebooks into specialized subsets.
They align sub-codebooks with data-dependent regions to optimize rate-distortion trade-offs and reduce computational and memory overhead.
Applications span adaptive deep network quantization, efficient beamforming in communications, and enhanced representation learning in generative models and speech.

Sub-codebooks for quantization are structured subsets or partitions of larger codebooks, employed within quantization schemes to improve representation efficiency, optimize rate-distortion trade-offs, and enhance computational and memory efficiency. By dividing a full codebook into multiple sub-codebooks—each specialized for a segment, spatial region, residual, or other data-dependent partition—quantizers more flexibly capture data heterogeneity, mitigate codebook collapse, and facilitate hierarchical or compositional quantization. Recent literature demonstrates sub-codebooks' relevance across domains such as high-rate vector quantization, large-scale approximate nearest neighbor search, adaptive deep network quantization, efficient beamforming in communications, and representation learning for generative models and speech.

1. Theoretical Foundations and Rate–Distortion Perspective

The motivation for sub-codebooks is rooted in the finite block-length rate–distortion theory, which quantifies how codebook partitioning impacts achievable distortion. The optimal rate–distortion function $D^*$ is achievable only asymptotically as block length $n \to \infty$ , but for finite $n$ , there is a nonzero excess distortion $\Delta D^n$ . The excess can be bounded via a convex optimization framework as

$D^n = D^* + \frac{\hat\lambda}{n}\left[nR - \sum_{j=1}^Q\int_{R_j}p(x^n)\log\frac{\hat q(x^n|y_j)}{p(x^n)}\,dx^n\right]$

where the sum/integral is over the codebook regions $R_j$ and $\hat q(x^n|y_j)$ is the optimal reconstruction conditional density for reconstruction point $y_j$ (1306.4754).

The structure of these bounds indicates that the way codebook regions are delineated (or, equivalently, how the codebook is partitioned into sub-codebooks) directly determines quantizer performance. In particular, the optimal partitioning of the source space into regions—each potentially served by a different sub-codebook—enables the quantizer to "capture" high-probability regions efficiently (for example, by aligning sub-codebooks with clusters in the data or geometric shells in high-dimensional Gaussians).

2. Hierarchical and Compositional Approaches

Hierarchical quantization strategies (such as residual or stacked quantization) leverage sub-codebooks at different stages, each targeting the residual error left by previous stages. In the Stacked Quantizer (SQ) framework, a set of $m$ sub-codebooks $\{C_1,\dots,C_m\}$ is used sequentially:

At each stage $i$ , the residual $r_{i-1}$ is quantized by $C_i$ , and updated as $r_i = r_{i-1} - C_i b_i$ , where $b_i$ is a one-hot code.
The final quantized approximation of $x$ is $x \approx \sum_{i=1}^m C_i b_i$ (1411.2173).

This approach enables each sub-codebook to focus only on the structure not captured by earlier stages, yielding compositional representation with reduced redundancy and enhanced reconstruction accuracy for a given global codebook cardinality. Complexity-wise, hierarchical schemes like SQ retain near-linear encoding cost and can be orders of magnitude more efficient than fully dependent additive quantization schemes.

Hierarchical or residual sub-codebook-based designs are also prominent in neural codebook methods. For example, in QINCo2, neural networks generate stepwise codebooks conditioned on previous reconstruction, and pairs of codebook indices are merged into joint sub-codebooks to capture dependencies between codewords, thus improving rate-distortion efficiency in large-scale vector compression and search (2501.03078).

3. Sub-Codebooks in Communication Systems and Signal Processing

Sub-codebooks have practical importance in communications—especially in massive MIMO and millimeter wave systems. In these domains, codebook design is tightly coupled with the physical and statistical structure of the channel.

Spatial Partitioning: Non-uniform quantization (NUQ) codebooks exploit the sparsity of mmWave channels by assigning more quantization bits to effective angular regions (spatial lobes) and using dedicated codebooks for these sectors. For the $i$ th spatial lobe, the codebook is constructed over its coverage angle $\mathcal{CV}(SL_i) = [\tilde\theta_i - \omega_i/2,\, \tilde\theta_i + \omega_i/2]$ , with high-resolution quantization, whereas ineffective regions are ignored (1803.00342).
Hybrid and Hierarchical Feedback: Trellis-extended codebooks (TEC) and trellis-extended successive phase adjustment (TE-SPA) deploy a trellis framework—effectively a sequence of interdependent sub-codebooks corresponding to state transitions—thus enabling fractional bits per entry, reduced search complexity, and efficient exploitation of temporal or spatial channel correlation (1402.6794).
Beam Management and Neural Design: In 5G NR, codebooks for different beam management stages (initial access, channel sounding, feedback) are optimized via learning-based techniques which, in essence, coordinate a sequence of sub-codebooks adapted to different granularity or feedback constraints (2303.02850).

4. Sub-Codebooks in Deep Learning Quantization

In deep neural networks, sub-codebooks are used to adapt to the statistical variability across weight groups or tensor subregions:

Fine-Grained Scaling: The SYQ method groups weights in convolutional filters by spatial locality or rows/columns, with each subgroup assigned its own symmetric scaling coefficient—effectively forming a set of sub-codebooks that better match local statistics and minimize gradient mismatch during training (1807.00301).
Groupwise Codebooks in LLMs: For ultra-low-precision quantization, groupwise non-uniform codebook-based methods divide weights into groups, cluster the histograms of their value distributions, and assign each to a representative sub-codebook optimized for that group's statistics. During inference, dequantization is performed by fetching the appropriate codebook and local scale factor, supporting both high accuracy (e.g., better perplexity) and efficient vectorized computation (2501.00032).
Block Clustered Quantization (BCQ): LO-BCQ iteratively clusters blocks of tensor elements and reoptimizes dedicated sub-codebooks for each cluster using the Lloyd-Max algorithm; this process provides nearly optimal 4-bit quantization of both weights and activations with minimal accuracy loss. The effective bitwidth is $B + \frac{\log_2(N_c)}{L_b} + \frac{B_s}{L_A}$ , where $N_c$ is the number of sub-codebooks, $L_b$ is block size, and $B_s$ is the number of bits for the scale factor (2502.05376).

5. Sub-Codebooks in Representation Learning and Generative Models

Sub-codebooks have emerged as crucial design elements in discrete representation learning and vector quantized architectures for neural generative models:

Sub-Codebooks for Latent Partitioning: The MGVQ approach divides the high-dimensional latent embedding produced by an encoder into $G$ sub-tokens, each quantized independently using a dedicated sub-codebook $E_i$ . This preserves total latent dimensionality, exponentially increases representational capacity to $K^G$ (for $K$ codewords per sub-codebook), and eases codebook optimization by keeping each sub-codebook small and homogeneous. During training, a nested masking strategy prevents over-reliance on only a subset of sub-codebooks and organizes features in a coarse-to-fine order, fostering robust information allocation (2507.07997).
Global–Local (Dual) Sub-Codebooks: In image modeling, Dual Codebook VQ partition encodings into a global component—updated via a transformer that coordinates all codewords contextually—and a local component—quantized deterministically. This complementary division, trained jointly from scratch, enables both global structure and local detail to be preserved, achieving improved FID and reconstruction over prior monolithic codebook or pre-training-dependent approaches (2503.10832).
Multi-Granular Sub-Codebooks in Speech: Segmentation-variant codebooks (SVCs) for speech are constructed over different linguistic units (frames, phones, words, utterances), yielding a set of sub-codebooks specialized for capturing paralinguistic/prosodic information at multiple timescales. Pooling features prior to quantization (pre-pooling) and using KMeans clustering at each granularity enhances transmission of expressive features while controlling bitrate (2505.15667).

6. Implementation Considerations and System-Level Impact

Codebook and Cache Organization for Efficient Inference: Efficient inference with vector quantization, particularly on memory-constrained or bandwidth-limited hardware (e.g., CPUs, GPUs, edge devices), often requires careful cache and memory hierarchy design for codebook access. Hierarchical caching of sub-codebooks in registers, shared memory, and off-chip memory (with adaptive allocation based on access frequency) significantly reduces latency, as demonstrated in VQ-LLM, enabling vector quantization schemes to achieve throughput comparable to or better than element-wise quantization at similar bitwidths (2503.02236).
Decoding, Scalability, and Parallel Processing: In large-scale vector search, approximate decoding using pairwise additive sub-codebooks, fast pre-selection heuristics, and parallelized lookup/combination routines ensures that the representational gains of sub-codebooks do not incur prohibitive computational or memory cost (2501.03078).
Hardware Efficiency: Symmetric and structured sub-codebooks, such as those based on the E8 lattice for blockwise quantization or designed for efficient SIMD unpacking, enable high computational density and low dequantization overhead—key requirements for deploying quantized models in real-time applications (2402.04396, 2501.00032).

7. Limitations, Trade-offs, and Future Directions

The use of sub-codebooks introduces several design considerations and trade-offs:

Overhead vs. Flexibility: Storing per-block or per-group sub-codebook selectors and scaling factors increases overhead, sometimes by ~0.5 bits per scalar, but is generally offset by marked improvements in quantization error and inference accuracy (2502.05376).
Scalability: Increasing the number of sub-codebooks enlarges the representational capacity and may ease optimization, but also increases parameter management complexity and necessitates careful hardware-aware tuning to retain throughput gains (2503.02236, 2507.07997).
Joint vs. Independent Modeling: Joint (e.g., pairwise) sub-codebook decoding better captures statistical dependencies but adds complexity in encoding/decoding; thus, practical systems strike a balance between speed and accuracy by selecting pre-selection or beam-search heuristics (2501.03078).

Future work will likely focus on adaptive partitioning strategies (potentially data-driven or neural network–controlled), dynamic codebook assignment, and domain-specific exploitation of multi-granular sub-codebook hierarchies. Such developments aim to further bridge the performance gap between discrete and continuous coding, maximize hardware and communication efficiency, and extend quantization theory and methodology to new data modalities and application domains.