Non-Uniform KV Cache Quantization

Updated 28 February 2026

Non-Uniform KV Cache Quantization is a method that dynamically adjusts bit allocation across channels, tokens, or layers to compress key-value caches in attention-based models.
It uses statistical structure and sensitivity metrics to guide precision allocation, effectively reducing memory footprint and computation without sacrificing model performance.
Practical implementations integrate mixed-bit assignment, tailored codebook construction, and hardware-aware cache layouts to achieve ultra-low-precision storage with minimal degradation.

Non-uniform KV cache quantization refers to a family of techniques for compressing the key-value (KV) cache in attention-based sequence models, such as LLMs, by allocating precision and compressing activations in a data-dependent, structure-aware, or hardware-aware manner. Rather than applying a uniform scalar quantizer (same bit-width, codebook, or quantization function across all elements), non-uniform methods adapt quantization parameters—bit-width, codebook, block size—spatially (across channels, groups, or positions), semantically (according to sensitivity or usage), or hierarchically (with cross-layer or multi-stage schemes). These approaches improve the trade-off between memory efficiency, computational efficiency, and output quality, enabling ultra-low-precision (≤2–3 bits) cache storage with negligible or zero generation degradation.

1. Principles and Motivation

The motivation for non-uniform quantization of the KV cache arises from several empirical and theoretical observations about cache statistics, model sensitivity, and system constraints:

Distributional Structure: KV activations exhibit blockwise, channelwise, or groupwise statistical structure, often closely approximated by normal distributions within blocks (Cai et al., 22 May 2025), heavy-tailed outlier behavior in small sub-tensors (Liu et al., 2024), or strong inter-channel dependencies (Zhang et al., 2024).
Sensitivity Heterogeneity: The importance of different channels, layers, or tokens to model output is highly non-uniform. Key cache activations tend to be more sensitive to quantization error than value cache, and only specific layers or positions are responsible for most of the accuracy loss under low-precision quantization (Tao et al., 2024, Zhang et al., 22 Dec 2025).
Bandwidth and Runtime Constraints: Uniform quantization incurs substantial memory and compute overhead—especially in hardware IO-bound environments. Non-uniform, block-adaptive, or hardware-aligned schemes can minimize both storage and memory movement by maximizing the reuse of quantization parameters and aligning quantization layout with GPU compute hierarchical properties (Hosseini et al., 26 Feb 2026).
Empirical Pareto Optimality: Non-uniform quantization unlocks much better Pareto frontiers of memory, error, throughput, and end-task accuracy, with >4–10× KV cache compression and <1% accuracy loss across a wide variety of tasks and model scales (Yang et al., 13 Oct 2025, Zhang et al., 22 Dec 2025, Xia et al., 23 Nov 2025, Chen et al., 23 May 2025).

2. Methodological Taxonomy

Non-uniform KV cache quantization encompasses a spectrum of techniques, which can be grouped into the following broad families:

Channel-/Token-/Layer-Adaptive Bit Allocation

Methods such as AsymKV (Tao et al., 2024), XQuant (Yang et al., 13 Oct 2025), MixKVQ (Zhang et al., 22 Dec 2025), Kitty (Xia et al., 23 Nov 2025), and QAQ (Dong et al., 2024) allocate bit-width or quantization scheme heterogeneous across layers, channels, or tokens:

Asymmetric layerwise allocation: AsymKV quantizes early (more sensitive) layers’ keys at higher bit-width (often 2–4 bit), and later layers at 1 bit; values are often quantized entirely at 1 bit. Bit allocation is selected via grid search under accuracy constraints (Tao et al., 2024).
Salience- or sensitivity-guided allocation: MixKVQ and Kitty score channels by metrics such as scaling factor, average absolute query activation, or empirical loss gradient, promoting only a small subset of channels to higher precision (e.g., 4 bit or BF16), with the majority at 2 bit (Zhang et al., 22 Dec 2025, Xia et al., 23 Nov 2025).
Cross-layer mixed bit-width: XQuant enables 1–2 bit quantization in most layers, using empirical calibration to guarantee overall average bit-width (as low as 1.38–1.4 bits), combined with cross-layer grouped sharing (Yang et al., 13 Oct 2025).
Per-token or per-vector adaptivity: QAQ dynamically selects the per-token bit-width, based on attention strength and variance propagation—guaranteeing overall bounded effect on the attention softmax (Dong et al., 2024).

Outlier and Importance-Aware Quantization

Dense-and-sparse splitting: Tools such as KVQuant (Hooper et al., 2024) and OTT (Su et al., 16 May 2025) explicitly identify per-token or per-channel outliers, storing them at high precision in a sparse side array, while compressing the dense bulk using non-uniform or uniform quantization. This sharply reduces quantization range and associated error (Su et al., 16 May 2025, Hooper et al., 2024).
Attention-based filtering: LogQuant (Chen et al., 25 Mar 2025) implements a log-distributed filtering scheme to maintain full-precision for a geometrically spaced subset of tokens likely to be attended in the future, and 2-bit for the rest.

Block-/Group-/Vector-Quantized Methods

Block-/sub-block statistical adaptation: NQKV (Cai et al., 22 May 2025) and NSNQuant (Son et al., 23 May 2025) adapt quantization bins or global codebooks based on local block statistics (e.g., quantile bins matched to intra-block Gaussian statistics), enabling globally shared codebooks and robust out-of-distribution generalization.
Vector-based quantization: CommVQ (Li et al., 23 Jun 2025), CQ (Zhang et al., 2024), and residual VQ methods (Kumar, 2024) jointly quantize groups of channels within a vector, leveraging codebooks trained via k-means, EM, or exponential moving average, and account for inter-channel dependencies that lower the required entropy budget at low bits.
Polar and groupwise transforms: PolarQuant (Han et al., 4 Feb 2025, Wu et al., 1 Feb 2025) applies random preconditioning and polar decomposition to recast quantization into angle-space, where non-uniform scalar quantization is applied to statistically concentrated distributions, eliminating normalization overhead and achieving competitive reconstruction error at very low bit-widths.

Decomposition-Driven and Hierarchical Approaches

Tensor decomposition: DecoQuant (Liu et al., 2024) applies matrix product operator (MPO) decomposition to partition the cache into nearly outlier-free large blocks (assigned low bits) and small, outlier-heavy sub-tensors (assigned full precision).
Multistage hierarchical quantization: Titanus (Chen et al., 23 May 2025) adds hierarchical quantization extension (HQE), where each channel maintains several quantization levels, expanding only when out-of-range values appear, thereby avoiding full cache requantization and maintaining negligible storage overhead.

Cross-Layer and Cross-Group Compression

Layer-sharing and group codebooks: XQuant implements cross-layer compression by grouping adjacent layers and sharing the quantized integer cache among them, storing only small per-layer scale/zero-point metadata (Yang et al., 13 Oct 2025).
Codebook commutativity and hardware fusion: CommVQ and InnerQ (Hosseini et al., 26 Feb 2026) design codebooks and cache layout to be compatible with rotary position embeddings and hardware memory access patterns, enabling efficient decoding and maximal scale-factor reuse.

3. Formalization and Practical Algorithms

The implementation of non-uniform KV cache quantization typically involves the following algorithmic primitives:

Mixed-Bit or Importance-Based Bit Assignment

For each quantization unit (channel, block, token, layer), the system computes a precision requirement based on one or more estimated or measured metrics:

Statistical sensitivity (e.g. scale, variance, quantization noise propagation)
Empirical gradient or Fisher information weighting (as in KVQuant (Hooper et al., 2024) and CQ (Zhang et al., 2024))
Dynamic or query-aware salience (MixKVQ (Zhang et al., 22 Dec 2025))

Bit allocation is then solved via greedy search, grid search, or explicit optimization to maximize compression under an end-to-end tolerance constraint.

Codebook Construction (Non-Uniform Quantization Levels)

Weighted or importance-aware k-means: Used to build non-uniform quantization levels that minimize sensitivity-weighted error (KVQuant (Hooper et al., 2024), CQ (Zhang et al., 2024), CommVQ (Li et al., 23 Jun 2025)).
Quantile-based (information-theoretic) bins: Per-block quantile thresholds, guaranteeing each bin contains equal probability under a known or assumed distribution, achieving optimal scalar quantization under the assumed block PDF (NQKV (Cai et al., 22 May 2025)).
Global/shared codebooks with structural transforms: NSNQuant aligns all sub-vectors to a standard normal prior (via double normalization and Hadamard), enabling robust, calibration-free coding (Son et al., 23 May 2025).

Decomposition/Splitting and Hierarchical Quantization

Tensor decomposition: Partitioning the cache into large/small sub-tensors with different bit assignments (DecoQuant (Liu et al., 2024)).
Hierarchical multi-level schemes: Maintaining per-channel quantization hierarchies that grow if out-of-range values are observed, avoiding cache-wide requantization (Titanus (Chen et al., 23 May 2025)).

Outlier Handling

Dense/sparse hybridization: Outliers above a percentile threshold are processed and stored separately at higher precision; this may be done per channel, token, vector, or group (Hooper et al., 2024, Su et al., 16 May 2025).

Efficient Data Movement and Hardware Alignment

Inner-dimension grouping and fused kernels: Groupwise or blockwise cache layout and fused dequantization/matmul reduce memory transactions to near theoretical lower bounds, as in InnerQ (Hosseini et al., 26 Feb 2026), Titanus (Chen et al., 23 May 2025), and DecoQuant (Liu et al., 2024).
Sparse data transfer: Only non-zero elements, along with compact bit-masks, are packed and transferred off-chip (Titanus (Chen et al., 23 May 2025)).

4. Empirical Results and Comparative Performance

Multiple papers demonstrate quantitative improvements in memory footprint, throughput, and model quality using non-uniform KV cache quantization:

Method	Bit Allocation	KV Compression	Quality Drop	Notable Results
Titanus (Chen et al., 23 May 2025)	Per-channel mixed 2/3/4/8	≥50× (off-chip)	Negligible	49.6× GPU throughput, 159.9× energy efficiency, 58.9% KV traffic reduction
XQuant (Yang et al., 13 Oct 2025)	1–2 bit, cross-layer	>90%	+0.3–0.5 BLEU	B_eq = 1.38 bit; outperforms KIVI/AsymKV at lower bit-widths
Kitty (Xia et al., 23 Nov 2025)	Channelwise, dynamic	~8×	<1pp	2.1–4.1× throughput gain, 8× batch at same budget, <4pp drop at 32K context
MixKVQ (Zhang et al., 22 Dec 2025)	Query-aware: BF16/4/2	4–8×	<1–2pp	On reasoning: closes within 1–2% of BF16; >2× throughput
NQKV (Cai et al., 22 May 2025)	Per-block quantile	4×	<0.3%	End-to-end throughput +9.3×; allows 2–4× larger batch/context
DecoQuant (Liu et al., 2024)	MPO, block non-uniform	~75%	<0.5pp	1.25× throughput, ≤2% overhead, no calibration data
CQ (Zhang et al., 2024)	Groupwise vector qntz	1–2 × FP16	Matches/≥	Stable at 1 bit/channel, outperforms uniform at 2 bits
InnerQ (Hosseini et al., 26 Feb 2026)	Groupwise hardware	up to 88%	Negligible	Up to 22% (vs outer grouping) and 88% (vs FP16) speedup; max device throughput

5. Theoretical Insights and Error Analysis

Analytical foundations underpinning non-uniform quantization efficacy include:

Information-theoretic optimality: When the activation distribution is known (e.g., Gaussian per block), quantile-based (i.e., non-uniform) scalar quantization achieves the minimum mean-squared error for a given bit budget (Cai et al., 22 May 2025).
Fisher-weighted distortion minimization: By weighting quantization codebook assignment by sensitivity (Fisher information), overall loss impact is minimized—a formal demonstration in the KVQuant approach shows empirical loss closely tracks theory (Hooper et al., 2024).
Amplification of key errors: Theoretical sensitivity analysis demonstrates that quantization error on key cache propagates exponentially through the attention softmax, justifying higher-precision allocation or outlier tracking for keys versus values (Tao et al., 2024, Zhang et al., 22 Dec 2025, Dong et al., 2024).
Structural commutativity with RoPE: Methods such as CommVQ enforce block-diagonal commutative codes so vector quantization remains compatible with rotary embedding transforms, avoiding extra computation and error (Li et al., 23 Jun 2025).
Error bounding via quantization intervals: Progressive residual and hierarchical approaches guarantee bounded error at each stage, such that compound MSE drops rapidly with stage count or quantization refinement (Chen et al., 23 May 2025, Xi et al., 3 Feb 2026).
Empirical attention spike distribution: Logarithmic sparsity in attention spike positions justifies log-distributed allocation of high-precision slots, minimizing accuracy penalty at fixed memory (Chen et al., 25 Mar 2025).

6. Hardware and System Integration

Practical success of non-uniform KV quantization in LLM deployment hinges on careful co-design with hardware constraints, memory hierarchy, and inference pipelines:

Groupwise parameter reuse: By grouping the inner dimension and aligning the cache layout with GPU memory/register blocks, as in InnerQ, dequantization can be fused with GEMMs with minimal additional bandwidth, yielding up to 22% speedup over prior outer-grouped approaches (Hosseini et al., 26 Feb 2026).
Sparse mask packing and data-moving: Titanus combines aggressive masked pruning with non-uniform quantization; only non-zeros, along with compact masks, are transferred, slashing off-chip traffic by over 50% (Chen et al., 23 May 2025).
Page-centric and fusion-friendly layouts: Page-based design (as in Kitty (Xia et al., 23 Nov 2025)) enables efficient, divergence-free, single-kernel dequantization on GPUs, even in the presence of mixed precision.
Universal or shared codebooks: Calibration-free codebook approaches (NSNQuant (Son et al., 23 May 2025)) enable robust cross-task deployment without per-model statistics, as the double normalization guarantees invariance.

7. Limitations, Trade-offs, and Open Challenges

Despite state-of-the-art performance, non-uniform quantization approaches entail several trade-offs and limitations:

Calibration and Adaptivity: Some methods require offline calibration (e.g., Fisher weighting, codebook k-means), which may degrade under distribution shift. Calibration-free transforms (NSNQuant, NQKV) address this at small additional compute cost (Son et al., 23 May 2025, Cai et al., 22 May 2025).
Overhead of Metadata: Block-level or per-channel quantization requires storage of scale/zero-point per group; very small blocks incur metadata dominance. Vector quantization methods must store codebooks (though these are typically negligible, <1% of total footprint).
Outlier Storage Management: Tracking, re-insertion, and exclusion of outliers must be carefully balanced to avoid erasing the memory/latency gains.
Complexity of Software Pipelines: Optimal speedups require tight coupling of quantization, cache layout, and fused kernel design (e.g., in Triton or CUDA), complicating integration.
Extreme Low Bit-Width Stability: At ultra-low bits (<2), error sensitivity increases, and further advances in codebook design, error compensation, or hybridization with eviction might be necessary (Zhang et al., 2024).
Generalization Across Modalities: Video (QVG (Xi et al., 3 Feb 2026)), speech, and multimodal models require domain-specific quantization that preserves cross-frame or cross-token correlations.

In summary, non-uniform KV cache quantization has matured to include a diverse ecosystem of adaptive, structure-aware, and hardware-aligned algorithms, consistently achieving multiple times compression and throughput improvement in long-context and reasoning-centric LLMs, while sustaining—or even enhancing—practical deployment quality (Chen et al., 23 May 2025, Yang et al., 13 Oct 2025, Zhang et al., 22 Dec 2025). The field continues to evolve with ongoing research into further adaptation, universality, and integration with next-generation inference hardware.