Papers
Topics
Authors
Recent
Search
2000 character limit reached

KV-Cache Quantization Techniques

Updated 21 February 2026
  • KV-Cache quantization is a family of techniques that compress key and value tensors in LLM attention caches using ultra-low bit (1–4 bits) methods to reduce memory and enhance throughput.
  • Techniques include affine/uniform quantization, mixed-precision adaptations, vector quantization, and adaptive token preservation, balancing compression gains with minimal accuracy loss.
  • Advanced methods like polar transformations and subspace-orthogonal quantization facilitate significant memory reduction (up to 8×) while maintaining efficient performance on diverse language tasks.

KV-cache quantization refers to the family of techniques for compressing the key and value tensors stored in the attention “KV cache” during autoregressive inference in LLMs. The KV cache, which grows linearly with context length, batch size, layer depth, and model width, rapidly becomes a primary memory and efficiency bottleneck. Quantizing the cache to ultra-low bit-widths (1–4 bits) enables large gains—longer context support, greater batch sizes, and faster decoding—with minimal impact on downstream quality when executed with carefully engineered algorithms.

1. Quantization Fundamentals and Baseline Strategies

The canonical approach to KV cache quantization is affine or uniform quantization, which replaces each floating-point value with a small integer plus a group-dependent scale and zero-point. At bit-width bb, the quantization is

qi=clip(xizs,0,2b1),x^i=qis+zq_i = \operatorname{clip}\left( \left\lfloor \frac{x_i - z}{s} \right\rceil, 0, 2^b-1 \right), \quad \hat x_i = q_i s + z

with s=(max(x)min(x))/(2b1)s = (\max(x) - \min(x))/(2^b-1) and z=min(x)z = \min(x). Quantization may be applied per channel (across tokens), per token (across channels), or per group. Empirically, keys are best quantized per channel and values per token due to their differing distributional properties (Su et al., 16 May 2025). Groupwise quantization mitigates outlier distortion but may incur significant parameter overhead due to many required scale/zero-point arrays (He et al., 2024). Extremely low bit-width scalar quantization (e.g., 1–2 bits) typically leads to sharp accuracy drops unless additional structure or adaptation is introduced (Zhang et al., 2024, Tao et al., 2024).

Vector quantization—jointly quantizing blocks of dd-dimensional vectors—further reduces the number of parameters, leverages inter-channel correlation, and can dramatically increase compression ratio with minimal quality loss, as shown in approaches like Coupled Quantization (CQ) (Zhang et al., 2024), Residual Vector Quantization (RVQ) (Kumar, 2024), and NSNQuant (Son et al., 23 May 2025).

2. Mixed-Precision, Channel-/Token-Aware, and Outlier-Resilient Methods

Several methods now adapt the quantization precision (i.e., number of bits) either by layer, channel, or token to maximize compression while controlling accuracy loss:

  • Dynamic Channel-wise Precision Boost: Kitty augments a baseline 2-bit channelwise quantizer by identifying the most error-sensitive channels (via simple magnitude heuristics or explicit MSE measurements on the attention logits) and boosting only a small fraction (e.g., 12.5–25%) to 4 bits. This achieves up to 8×8\times memory compression with negligible accuracy loss and allows 8×\times larger batches (Xia et al., 23 Nov 2025).
  • Layerwise Asymmetric Quantization: AsymKV observes that model output is more sensitive to errors in key quantization than value quantization. By quantizing keys at higher precision in early layers and using lower precision in later layers and for values, up to 75% of the KV cache can be stored at 1 bit without significant degradation (Tao et al., 2024).
  • Mixed-Precision with Gradient-based Layer Importance: KVmix assigns higher bit-widths to layers with greater loss sensitivity (as measured by the gradients of the key and value projection weights), and leverages a recent-pivotal-context (RPC) buffer to keep the most recent tokens at full precision (Li et al., 18 May 2025).
  • Token/Channel Outlier Awareness: Outlier Tokens Tracing (OTT) dynamically identifies, during decoding, tokens whose presence would inflate channel quantization range and disproportionately degrade accuracy. These tokens are excluded from quantization and stored in a small side buffer, improving 2-bit KV quantization accuracy by 3–8 points absolute while preserving throughput gains (Su et al., 16 May 2025). ZipCache proposes token “saliency” scored via normalized attention and preserves high-saliency tokens at higher precision (He et al., 2024).
Method Adaptivity Principle Bit-width/Compression Notable Results
Kitty (Xia et al., 23 Nov 2025) Channelwise Sensitivity 2b+ (w/ 4b boost) 8×\times memory cut, <1pt loss
AsymKV (Tao et al., 2024) Layerwise Structural role 1–2b (layer-adaptive) 75% at 1b, negligible degradation
KVmix (Li et al., 18 May 2025) Layerwise Gradients $2.19$–$2.38$b 4.9×\times compression, minimal loss
OTT (Su et al., 16 May 2025) Tokenwise Outlier tokens 2b (plus pool) 6.4×\times memory, closes accuracy gap
ZipCache (He et al., 2024) Tokenwise Saliency score 4b/2b hybrid <0.5%<0.5\% loss at 5×5\times compression

3. Vector, Polar, and Subspace Quantization Approaches

Modern vector quantization approaches exploit nontrivial statistics of the KV representations:

  • Coupled Quantization (CQ): Groups multiple correlated channels and learns a codebook across the group, enabling quantization to as low as 1 bit per channel with only moderate accuracy loss (Zhang et al., 2024).
  • Residual Vector Quantization: Successively encodes the residual error using multiple codebooks. With depth 8, can reach 5.5×5.5\times compression relative to FP16 and nearly preserve model performance; noncontiguous grouping of channels yields the best trade-off for keys (Kumar, 2024).
  • Hadamard/PCA-based Rotation: NSNQuant and KVLinC apply a Hadamard (or similar orthogonal) rotation to standardize/whiten KV vectors, then perform blockwise VQ. NSNQuant’s two-step normalization followed by a Hadamard transform enables calibration-free, single-codebook quantization robust to distribution shift (Son et al., 23 May 2025). KVLinC combines this for values with learned linear-correction adapters for keys (Saxena et al., 6 Oct 2025).
  • Polar and Angle-based Quantization: PolarQuant converts pairs (or recursively, larger blocks) of post-RoPE or random-rotated KV vectors into polar coordinates, quantizing the angle(s) and (sometimes) the radius. This inherits smooth, concentrated statistical properties, allowing efficient quantization and removal of per-block scale overhead (Wu et al., 1 Feb 2025, Han et al., 4 Feb 2025).
  • Subspace-Orthogonal Quantization: SQuat quantizes projectively, guaranteeing that the error introduced is orthogonal to the dominant subspace spanned by the queries, minimizing attention output distortion. This approach achieves nearly lossless 2-bit quantization, with no model tuning or calibration (Wang et al., 31 Mar 2025).
  • Additive and Commutative Structures: CommVQ uses additive vector quantization with codebooks constrained to be commutative with RoPE, streamlining decoding and allowing 1–2 bit caching with minimal loss even for 128K-token contexts (Li et al., 23 Jun 2025).

4. Adaptive, Token- and Importance-Aware Selection

A growing trend is the explicit selection and unquantized storage of a small subset of “important” tokens, channels, or subspaces:

  • Salient/Anchor Token Preservation: Methods including AnTKV and ZipCache introduce theoretical error analyses to assign importance scores (Anchor Score / normalized attention) to each token. By retaining only the top-k (e.g., 1%) tokens in full-precision, these methods enable sub-bit average quantization (e.g., 0.375b) while maintaining acceptable perplexity and accuracy (Li et al., 24 Jun 2025, He et al., 2024).
  • Semantic Token Filtering in VLLMs: For video LLMs, VidKV shows that per-channel value quantization with semantic token filtering (per cross-modal score) outperforms prior per-token methods, even at 1.5–1.66 bits, for negligible loss (Tao et al., 20 Mar 2025).
  • Log-Distributed, Sparsified Caching: LogQuant redistributes the sparsely chosen full-precision tokens over the full context according to a base-2 thinning pattern, preserving influential “spikes” far into the past and delivering 2x–4x boosts in math/code accuracy at the same compression rate compared to prior windowed schemes (Chen et al., 25 Mar 2025).

5. Ultra-Low-Bit, Sub-bit, and Calibration-Free Regimes

Pushing beyond 2 bits per element presents unique challenges. Recent progress includes:

  • Calibration-Free and Data-Free Methods: XQuant achieves sub-1.4 bit equivalent bit-width by combining a data-free endpoint rescaling (no real calibration samples needed) with cross-layer cache compression, allowing a single cache to be shared between pairs of layers (Yang et al., 13 Oct 2025). NSNQuant’s double-normalization and codebook trained on N(0, I) enables robust operation out-of-domain without explicit calibration (Son et al., 23 May 2025).
  • 1-bit and Sub-bit Quantization: CalibQuant applies 1-bit quantization with per-group quantile bounds, supplementing with calibration on small (20–50 sample) prompt sets and per-head scale/zero offset to recover nearly all accuracy (Han et al., 15 Feb 2025). Polar, coupled, or additive vector quantization likewise achieves 1–1.5 bits average for >90% compression; methods such as AnTKV and CommVQ demonstrate ultra-low bit operation with negligible accuracy loss (Li et al., 23 Jun 2025, Li et al., 24 Jun 2025).
  • Per-Block Optimal Quantization: NQKV exploits empirical normality within token/channel blocks and applies per-block quantile quantization, approximating the information-theoretic minimum distortion strategy. This enables 4x compression versus FP16 without perceptible accuracy drop (Cai et al., 22 May 2025).

6. Hardware, Runtime, and Implementation Considerations

Modern methods co-design algorithm and system to ensure hardware efficiency:

  • Fused and Custom Kernels: All state-of-the-art methods exploit group/block packing for memory coalescing and design fused CUDA/Triton kernels for on-the-fly dequantization and attention (Xia et al., 23 Nov 2025, Saxena et al., 6 Oct 2025, Li et al., 23 Jun 2025).
  • Parameter Overhead: The size and count of additional scale, zero-point, or codebook parameters are primary limitations—aggressive grouping, codebook sharing, or structure-exploiting transforms (Hadamard, polar) are employed to keep these costs negligible (often <0.5% cache memory) (Xia et al., 23 Nov 2025, Han et al., 4 Feb 2025, Son et al., 23 May 2025).
  • Throughput and Scalability: All top methods report 2–8×\times throughput gains and support scaling to batch size or context length multiples not feasible in FP16. For example, Kitty enables 8×8\times larger batches on LLaMA-3-8B (Xia et al., 23 Nov 2025), NSNQuant scales batch size 4×\times with 7–8×\times memory reduction (Son et al., 23 May 2025), and AnTKV enables 810k-token context length on a single A100 (Li et al., 24 Jun 2025).

7. Benchmarking and Trade-Offs

Empirical evaluation is performed on reasoning (GSM8K, MMLU), long-context (LongBench, NIaH), code (HumanEval), and generative (summarization, QA) tasks. Best-practice guidelines are to:

  • Prefer dynamic and adaptive quantization to avoid worst-case performance collapse at low bits.
  • Use vector/group quantization or polar/hadamard transforms to exploit structure.
  • Retain a small, carefully chosen set of important tokens (anchor tokens, salient/semantic tokens) at higher precision in ultra-low bit regimes.
  • Balance group size, codebook capacity, and scale/zero-point granularity to optimize the accuracy/compression trade-off.
  • Apply offline calibration judiciously for block and codebook statistics unless using methods designed to be calibration-free (XQuant, NSNQuant).

On well-studied tasks, methods such as SQuat, AnTKV, Kitty, and LogQuant demonstrate either parity or clear improvement over FP16 and earlier 2-bit baselines, while consistently achieving 4×\times–8×\times memory reduction and 2×\times–4×\times decoding throughput increases.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KV-Cache Quantization.