KV-Cache Quantization Techniques
- KV-Cache quantization is a family of techniques that compress key and value tensors in LLM attention caches using ultra-low bit (1–4 bits) methods to reduce memory and enhance throughput.
- Techniques include affine/uniform quantization, mixed-precision adaptations, vector quantization, and adaptive token preservation, balancing compression gains with minimal accuracy loss.
- Advanced methods like polar transformations and subspace-orthogonal quantization facilitate significant memory reduction (up to 8×) while maintaining efficient performance on diverse language tasks.
KV-cache quantization refers to the family of techniques for compressing the key and value tensors stored in the attention “KV cache” during autoregressive inference in LLMs. The KV cache, which grows linearly with context length, batch size, layer depth, and model width, rapidly becomes a primary memory and efficiency bottleneck. Quantizing the cache to ultra-low bit-widths (1–4 bits) enables large gains—longer context support, greater batch sizes, and faster decoding—with minimal impact on downstream quality when executed with carefully engineered algorithms.
1. Quantization Fundamentals and Baseline Strategies
The canonical approach to KV cache quantization is affine or uniform quantization, which replaces each floating-point value with a small integer plus a group-dependent scale and zero-point. At bit-width , the quantization is
with and . Quantization may be applied per channel (across tokens), per token (across channels), or per group. Empirically, keys are best quantized per channel and values per token due to their differing distributional properties (Su et al., 16 May 2025). Groupwise quantization mitigates outlier distortion but may incur significant parameter overhead due to many required scale/zero-point arrays (He et al., 2024). Extremely low bit-width scalar quantization (e.g., 1–2 bits) typically leads to sharp accuracy drops unless additional structure or adaptation is introduced (Zhang et al., 2024, Tao et al., 2024).
Vector quantization—jointly quantizing blocks of -dimensional vectors—further reduces the number of parameters, leverages inter-channel correlation, and can dramatically increase compression ratio with minimal quality loss, as shown in approaches like Coupled Quantization (CQ) (Zhang et al., 2024), Residual Vector Quantization (RVQ) (Kumar, 2024), and NSNQuant (Son et al., 23 May 2025).
2. Mixed-Precision, Channel-/Token-Aware, and Outlier-Resilient Methods
Several methods now adapt the quantization precision (i.e., number of bits) either by layer, channel, or token to maximize compression while controlling accuracy loss:
- Dynamic Channel-wise Precision Boost: Kitty augments a baseline 2-bit channelwise quantizer by identifying the most error-sensitive channels (via simple magnitude heuristics or explicit MSE measurements on the attention logits) and boosting only a small fraction (e.g., 12.5–25%) to 4 bits. This achieves up to memory compression with negligible accuracy loss and allows 8 larger batches (Xia et al., 23 Nov 2025).
- Layerwise Asymmetric Quantization: AsymKV observes that model output is more sensitive to errors in key quantization than value quantization. By quantizing keys at higher precision in early layers and using lower precision in later layers and for values, up to 75% of the KV cache can be stored at 1 bit without significant degradation (Tao et al., 2024).
- Mixed-Precision with Gradient-based Layer Importance: KVmix assigns higher bit-widths to layers with greater loss sensitivity (as measured by the gradients of the key and value projection weights), and leverages a recent-pivotal-context (RPC) buffer to keep the most recent tokens at full precision (Li et al., 18 May 2025).
- Token/Channel Outlier Awareness: Outlier Tokens Tracing (OTT) dynamically identifies, during decoding, tokens whose presence would inflate channel quantization range and disproportionately degrade accuracy. These tokens are excluded from quantization and stored in a small side buffer, improving 2-bit KV quantization accuracy by 3–8 points absolute while preserving throughput gains (Su et al., 16 May 2025). ZipCache proposes token “saliency” scored via normalized attention and preserves high-saliency tokens at higher precision (He et al., 2024).
| Method | Adaptivity | Principle | Bit-width/Compression | Notable Results |
|---|---|---|---|---|
| Kitty (Xia et al., 23 Nov 2025) | Channelwise | Sensitivity | 2b+ (w/ 4b boost) | 8 memory cut, <1pt loss |
| AsymKV (Tao et al., 2024) | Layerwise | Structural role | 1–2b (layer-adaptive) | 75% at 1b, negligible degradation |
| KVmix (Li et al., 18 May 2025) | Layerwise | Gradients | $2.19$–$2.38$b | 4.9 compression, minimal loss |
| OTT (Su et al., 16 May 2025) | Tokenwise | Outlier tokens | 2b (plus pool) | 6.4 memory, closes accuracy gap |
| ZipCache (He et al., 2024) | Tokenwise | Saliency score | 4b/2b hybrid | loss at compression |
3. Vector, Polar, and Subspace Quantization Approaches
Modern vector quantization approaches exploit nontrivial statistics of the KV representations:
- Coupled Quantization (CQ): Groups multiple correlated channels and learns a codebook across the group, enabling quantization to as low as 1 bit per channel with only moderate accuracy loss (Zhang et al., 2024).
- Residual Vector Quantization: Successively encodes the residual error using multiple codebooks. With depth 8, can reach compression relative to FP16 and nearly preserve model performance; noncontiguous grouping of channels yields the best trade-off for keys (Kumar, 2024).
- Hadamard/PCA-based Rotation: NSNQuant and KVLinC apply a Hadamard (or similar orthogonal) rotation to standardize/whiten KV vectors, then perform blockwise VQ. NSNQuant’s two-step normalization followed by a Hadamard transform enables calibration-free, single-codebook quantization robust to distribution shift (Son et al., 23 May 2025). KVLinC combines this for values with learned linear-correction adapters for keys (Saxena et al., 6 Oct 2025).
- Polar and Angle-based Quantization: PolarQuant converts pairs (or recursively, larger blocks) of post-RoPE or random-rotated KV vectors into polar coordinates, quantizing the angle(s) and (sometimes) the radius. This inherits smooth, concentrated statistical properties, allowing efficient quantization and removal of per-block scale overhead (Wu et al., 1 Feb 2025, Han et al., 4 Feb 2025).
- Subspace-Orthogonal Quantization: SQuat quantizes projectively, guaranteeing that the error introduced is orthogonal to the dominant subspace spanned by the queries, minimizing attention output distortion. This approach achieves nearly lossless 2-bit quantization, with no model tuning or calibration (Wang et al., 31 Mar 2025).
- Additive and Commutative Structures: CommVQ uses additive vector quantization with codebooks constrained to be commutative with RoPE, streamlining decoding and allowing 1–2 bit caching with minimal loss even for 128K-token contexts (Li et al., 23 Jun 2025).
4. Adaptive, Token- and Importance-Aware Selection
A growing trend is the explicit selection and unquantized storage of a small subset of “important” tokens, channels, or subspaces:
- Salient/Anchor Token Preservation: Methods including AnTKV and ZipCache introduce theoretical error analyses to assign importance scores (Anchor Score / normalized attention) to each token. By retaining only the top-k (e.g., 1%) tokens in full-precision, these methods enable sub-bit average quantization (e.g., 0.375b) while maintaining acceptable perplexity and accuracy (Li et al., 24 Jun 2025, He et al., 2024).
- Semantic Token Filtering in VLLMs: For video LLMs, VidKV shows that per-channel value quantization with semantic token filtering (per cross-modal score) outperforms prior per-token methods, even at 1.5–1.66 bits, for negligible loss (Tao et al., 20 Mar 2025).
- Log-Distributed, Sparsified Caching: LogQuant redistributes the sparsely chosen full-precision tokens over the full context according to a base-2 thinning pattern, preserving influential “spikes” far into the past and delivering 2x–4x boosts in math/code accuracy at the same compression rate compared to prior windowed schemes (Chen et al., 25 Mar 2025).
5. Ultra-Low-Bit, Sub-bit, and Calibration-Free Regimes
Pushing beyond 2 bits per element presents unique challenges. Recent progress includes:
- Calibration-Free and Data-Free Methods: XQuant achieves sub-1.4 bit equivalent bit-width by combining a data-free endpoint rescaling (no real calibration samples needed) with cross-layer cache compression, allowing a single cache to be shared between pairs of layers (Yang et al., 13 Oct 2025). NSNQuant’s double-normalization and codebook trained on N(0, I) enables robust operation out-of-domain without explicit calibration (Son et al., 23 May 2025).
- 1-bit and Sub-bit Quantization: CalibQuant applies 1-bit quantization with per-group quantile bounds, supplementing with calibration on small (20–50 sample) prompt sets and per-head scale/zero offset to recover nearly all accuracy (Han et al., 15 Feb 2025). Polar, coupled, or additive vector quantization likewise achieves 1–1.5 bits average for >90% compression; methods such as AnTKV and CommVQ demonstrate ultra-low bit operation with negligible accuracy loss (Li et al., 23 Jun 2025, Li et al., 24 Jun 2025).
- Per-Block Optimal Quantization: NQKV exploits empirical normality within token/channel blocks and applies per-block quantile quantization, approximating the information-theoretic minimum distortion strategy. This enables 4x compression versus FP16 without perceptible accuracy drop (Cai et al., 22 May 2025).
6. Hardware, Runtime, and Implementation Considerations
Modern methods co-design algorithm and system to ensure hardware efficiency:
- Fused and Custom Kernels: All state-of-the-art methods exploit group/block packing for memory coalescing and design fused CUDA/Triton kernels for on-the-fly dequantization and attention (Xia et al., 23 Nov 2025, Saxena et al., 6 Oct 2025, Li et al., 23 Jun 2025).
- Parameter Overhead: The size and count of additional scale, zero-point, or codebook parameters are primary limitations—aggressive grouping, codebook sharing, or structure-exploiting transforms (Hadamard, polar) are employed to keep these costs negligible (often <0.5% cache memory) (Xia et al., 23 Nov 2025, Han et al., 4 Feb 2025, Son et al., 23 May 2025).
- Throughput and Scalability: All top methods report 2–8 throughput gains and support scaling to batch size or context length multiples not feasible in FP16. For example, Kitty enables larger batches on LLaMA-3-8B (Xia et al., 23 Nov 2025), NSNQuant scales batch size 4 with 7–8 memory reduction (Son et al., 23 May 2025), and AnTKV enables 810k-token context length on a single A100 (Li et al., 24 Jun 2025).
7. Benchmarking and Trade-Offs
Empirical evaluation is performed on reasoning (GSM8K, MMLU), long-context (LongBench, NIaH), code (HumanEval), and generative (summarization, QA) tasks. Best-practice guidelines are to:
- Prefer dynamic and adaptive quantization to avoid worst-case performance collapse at low bits.
- Use vector/group quantization or polar/hadamard transforms to exploit structure.
- Retain a small, carefully chosen set of important tokens (anchor tokens, salient/semantic tokens) at higher precision in ultra-low bit regimes.
- Balance group size, codebook capacity, and scale/zero-point granularity to optimize the accuracy/compression trade-off.
- Apply offline calibration judiciously for block and codebook statistics unless using methods designed to be calibration-free (XQuant, NSNQuant).
On well-studied tasks, methods such as SQuat, AnTKV, Kitty, and LogQuant demonstrate either parity or clear improvement over FP16 and earlier 2-bit baselines, while consistently achieving 4–8 memory reduction and 2–4 decoding throughput increases.
References
- (Zhang et al., 2024) "KV Cache is 1 Bit Per Channel: Efficient LLM Inference with Coupled Quantization"
- (He et al., 2024) "ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification"
- (Tao et al., 2024) "AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations"
- (Kumar, 2024) "Residual vector quantization for KV cache compression in LLM"
- (Wu et al., 1 Feb 2025) "PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration"
- (Han et al., 4 Feb 2025) "PolarQuant: Quantizing KV Caches with Polar Transformation"
- (Han et al., 15 Feb 2025) "CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs"
- (Tao et al., 20 Mar 2025) "Plug-and-Play 1.x-Bit KV Cache Quantization for Video LLMs"
- (Chen et al., 25 Mar 2025) "LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation"
- (Wang et al., 31 Mar 2025) "SQuat: Subspace-orthogonal KV Cache Quantization"
- (Su et al., 16 May 2025) "Accurate KV Cache Quantization with Outlier Tokens Tracing"
- (Cai et al., 22 May 2025) "NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics"
- (Son et al., 23 May 2025) "NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache"
- (Li et al., 18 May 2025) "KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache"
- (Li et al., 23 Jun 2025) "CommVQ: Commutative Vector Quantization for KV Cache Compression"
- (Li et al., 24 Jun 2025) "AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in LLMs"
- (Saxena et al., 6 Oct 2025) "KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction"
- (Yang et al., 13 Oct 2025) "XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression"
- (Xia et al., 23 Nov 2025) "Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost"