Constant-Memory KV Cache Methods
- Constant-memory KV cache is a technique that bounds GPU memory usage in Transformer inference by compressing and re-architecting key-value pairs regardless of sequence length.
- Methods such as mixed-precision quantization and importance-aware token retention achieve up to 20× memory reduction while preserving model accuracy and contextual integrity.
- Advanced strategies combine adaptive token selection, sparse representations, and system-level memory reallocation to enhance throughput and enable long-context, high-throughput inference.
Constant-memory KV cache refers to techniques for bounding the GPU or system memory footprint of the key-value (KV) cache used in Transformer-based LLMs during inference, such that memory usage remains effectively independent of sequence length or input context size. This problem is motivated by the observation that, in standard autoregressive inference, the KV cache grows linearly with the number of input and generated tokens and often dominates memory consumption, representing a severe bottleneck particularly for long-context and high-throughput inference. Recent research has produced a diversity of algorithmic, systems, and architectural approaches that tightly compress, prune, or re-architect the KV cache to provide constant or nearly constant memory usage, with minimal (or zero) accuracy trade-off even at aggressive compression rates.
1. Motivation and Challenges in KV Cache Compression
The KV cache is central to efficient transformer inference, enabling O(1) or O(n) complexity for sequential token generation by storing per-token key and value activations from every attention layer. However, for long sequences, accumulated KV pairs consume substantial memory—often exceeding the model weights themselves—substantially limiting achievable context length or batch size on accelerators with fixed memory budgets (Yang et al., 28 Feb 2024, Zhang et al., 4 Dec 2024, Zhang et al., 16 Dec 2024). Traditional mitigation strategies—such as discarding past tokens, static quantization, or attention window sliding—introduce context loss, hallucinations, and degraded accuracy, with especially pronounced deficits in safety-critical or context-sensitive applications (Yang et al., 28 Feb 2024). The core challenge is to decouple the cache size from the context length while maintaining high-fidelity context for all relevant tasks.
2. Mixed-Precision Quantization and Importance-Aware Retention
Several recent methodologies address the constant-memory goal using mixed-precision quantization and importance-aware token retention. MiKV (Yang et al., 28 Feb 2024) employs an adaptive scheme where critical KV pairs (as determined by attention-based importance scoring) are stored at high precision (e.g., FP16 or 8-bit), while unimportant pairs—typically destined for eviction—are retained at lower precision (e.g., INT4 or INT2). This approach ensures that even "evicted" tokens continue to convey partial contextual information, leading to minimal quality degradation. The quantization is performed via asymmetric per-token formulas, with careful outlier handling using per-channel balancing:
This regime bounds total cache growth, enabling constant memory operation at compression ratios as low as 20–25% of the full cache size, with accuracy and contextuality nearly indistinguishable from uncompressed baselines.
LeanKV (Zhang et al., 4 Dec 2024) further refines this by introducing Hetero-KV quantization (allocating higher precision to keys than values) and per-head adaptive sparsity, while integrating on-GPU memory management to support fine-grained mixed-precision layouts, thereby offering to compression at near-lossless performance.
3. Personalized and Adaptive Cache Budgeting
Constant-memory can also be attained via dynamic, non-uniform allocation of cache resources according to layer- or head-specific metrics. XKV (Li et al., 8 Dec 2024) observes pronounced variance in the retention importance of KV pairs across network layers—a phenomenon quantified using per-layer attention score vectors and the metric
where is the retained KV pair count for layer , and its attention vector. By formulating cache allocation as a combinatorial optimization, XKV assigns personalized retention budgets per layer, solving via a greedy allocation that maximizes overall contextual retention under a global memory constraint. The result is a mean reduction in KV cache size and up to a throughput boost on long-context benchmarks.
BaKlaVa (Gulhan et al., 18 Feb 2025) generalizes this approach to head-level granularity, using a one-time attention-profile driven budget search that reallocates saved memory from low-importance heads/layers to high-importance ones, yielding up to a compression ratio without baseline performance loss.
CAKE (Qin et al., 16 Mar 2025) introduces a "cake-slicing" methodology, assigning cache budgets in proportion to each layer’s preference
where and are the attention entropy and variance (spatial and temporal dispersion). Guided by these metrics, CAKE adaptively slices the global memory budget and performs cascading eviction while using a novel mean-variance eviction indicator to support dynamic, context-sensitive retention.
4. Sparse and Approximate Representations
A distinct approach to constant-memory caching transforms the dense KV tensor into sparse or compressed forms. CSR (Zhang et al., 16 Dec 2024) constructs a sparse representation using Matching Pursuit over a learned dictionary :
with constrained to non-zero coefficients, learned using NeuralDict. Only indices and coefficients are stored, reducing memory to as little as bit/channel, approaching the theoretical compressibility limit for practical context lengths. Inference reconstructs approximate keys and values on demand, preserving accuracy at extreme compression rates.
Streaming and subsampling-based approximations, exemplified by BalanceKV (Han et al., 11 Feb 2025), use discrepancy theory to subsample a balanced subset of key-value states via recursive streaming merge-and-reduce:
Such techniques rigorously bound the approximation error while capping memory usage sublinearly in sequence length.
5. Token Selection, Merging, and Residual Techniques
A spectrum of methods compresses the KV cache by contextually selecting and aggregating tokens. MorphKV (Ghadia et al., 2 Mar 2025) maintains a constant-size cache by dynamically partitioning tokens into a fixed window of recent tokens and a selection of correlated distant tokens ranked using recent attention scores:
This adaptive correlation-aware retention avoids early-token bias, achieving average savings over without context degradation.
ZeroMerge (Liu et al., 13 Mar 2025) (ZSMerge) combines head-level token importance budgeting, residual merging (token fusion by dot-product compatibility and momentum), and compensated attention scoring:
enabling memory reductions by ( of original) at constant performance and improved throughput.
KeepKV (Tian et al., 14 Apr 2025) eliminates the “attention sag” of convex merging by recording merging history (electoral votes) and applying zero-perturbation merging with adaptive scaling:
The merge weights are proportional to prior attention and votes, yielding exact output preservation at each step.
GraphKV (Li et al., 30 Aug 2025) advances the field by recasting token retention as a graph signal-propagation problem, penalizing redundancy through iterative cosine-similarity decay:
Retained tokens are both high-importance and contextually diverse, with demonstrable gains under constant-memory budgets.
6. Quantization and Coding Optimized for KV Cache
Quantization-specific strategies, such as NQKV (Cai et al., 22 May 2025), exploit the normal-like distribution of KV cache blocks to perform information-theoretically optimal, blockwise quantile quantization in 4 bits ("NF4"):
These methods, combined with blockwise dequantization on inference, realize memory reductions with sub-percent accuracy loss, allowing for batch size or longer context within the same memory constraints.
KVComp (Jiang et al., 30 Aug 2025) extends this direction by coupling error-controlled, token-wise quantization and GPU-optimized Huffman encoding. Critical architectural co-designs—such as cache-resident decompression and fused decoding-matrix multiplication—support up to higher memory reduction and outperform standard attention kernels in total throughput.
7. Advanced System and Architectural Adaptations
Some solutions achieve a constant-memory profile not (only) by compressing the KV cache, but by systemic or architectural memory management. MIRAGE (Li et al., 15 Jul 2025) dynamically repurposes preallocated GPU parameter memory (from inactive models or temporarily unused layers) for KV cache, with a layer selection protocol ensuring forced reallocation does not impede computation:
This strategy, optimized for high-bandwidth interconnects (e.g., NVIDIA GH200), yields up to reduction in latency and higher throughput in multi-tenant LLM serving.
8. Plug-and-Play and Orthogonal Methods
Recent innovations emphasize modularity: KVCrush (Jha et al., 24 Feb 2025) introduces a binary, head-behavior-based token representation and low-overhead Hamming grouping, operating in conjunction with other approaches such as mixed precision. CommonKV (Wang et al., 22 Aug 2025) employs SVD-based cross-layer parameter sharing to align latent representations, with adaptive budget allocation based on cache similarity, and is orthogonal to quantization and eviction strategies; their integration achieves up to compression with limited loss.
Other frameworks, like PQCache (Zhang et al., 1 Jul 2024), apply product quantization for keys in a two-phase scheme (prefill and decoding), supporting approximate MIPS for token selection and overlapped CPU–GPU operation; or KVzip (Kim et al., 29 May 2025), which uses LLM-driven context reconstruction as a universal, query-agnostic importance metric for retention, with scoring performed in constant-memory chunked sweeps.
Conclusion
Constant-memory KV cache is an active field of research at the intersection of quantization, pruning, selection, approximation, and systems engineering. No single method dominates: mixed-precision, sparse coding, adaptive selection, and architectural strategies can be co-designed or composited for diverse model deployments. Current methods demonstrate the feasibility of reducing KV cache memory usage by up to or more, with minimal context or accuracy loss—enabling long-context and batch-intensive LLM inference on commodity hardware, and representing a foundational advance in the tractability and scalability of transformer models for natural language and vision.