Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Constant-Memory KV Cache Methods

Updated 30 September 2025
  • Constant-memory KV cache is a technique that bounds GPU memory usage in Transformer inference by compressing and re-architecting key-value pairs regardless of sequence length.
  • Methods such as mixed-precision quantization and importance-aware token retention achieve up to 20× memory reduction while preserving model accuracy and contextual integrity.
  • Advanced strategies combine adaptive token selection, sparse representations, and system-level memory reallocation to enhance throughput and enable long-context, high-throughput inference.

Constant-memory KV cache refers to techniques for bounding the GPU or system memory footprint of the key-value (KV) cache used in Transformer-based LLMs during inference, such that memory usage remains effectively independent of sequence length or input context size. This problem is motivated by the observation that, in standard autoregressive inference, the KV cache grows linearly with the number of input and generated tokens and often dominates memory consumption, representing a severe bottleneck particularly for long-context and high-throughput inference. Recent research has produced a diversity of algorithmic, systems, and architectural approaches that tightly compress, prune, or re-architect the KV cache to provide constant or nearly constant memory usage, with minimal (or zero) accuracy trade-off even at aggressive compression rates.

1. Motivation and Challenges in KV Cache Compression

The KV cache is central to efficient transformer inference, enabling O(1) or O(n) complexity for sequential token generation by storing per-token key and value activations from every attention layer. However, for long sequences, accumulated KV pairs consume substantial memory—often exceeding the model weights themselves—substantially limiting achievable context length or batch size on accelerators with fixed memory budgets (Yang et al., 28 Feb 2024, Zhang et al., 4 Dec 2024, Zhang et al., 16 Dec 2024). Traditional mitigation strategies—such as discarding past tokens, static quantization, or attention window sliding—introduce context loss, hallucinations, and degraded accuracy, with especially pronounced deficits in safety-critical or context-sensitive applications (Yang et al., 28 Feb 2024). The core challenge is to decouple the cache size from the context length while maintaining high-fidelity context for all relevant tasks.

2. Mixed-Precision Quantization and Importance-Aware Retention

Several recent methodologies address the constant-memory goal using mixed-precision quantization and importance-aware token retention. MiKV (Yang et al., 28 Feb 2024) employs an adaptive scheme where critical KV pairs (as determined by attention-based importance scoring) are stored at high precision (e.g., FP16 or 8-bit), while unimportant pairs—typically destined for eviction—are retained at lower precision (e.g., INT4 or INT2). This approach ensures that even "evicted" tokens continue to convey partial contextual information, leading to minimal quality degradation. The quantization is performed via asymmetric per-token formulas, with careful outlier handling using per-channel balancing:

x^=I(x)=αxβα+β,α=max(x)min(x)2N1,    β=min(x)\hat{x} = I(x) = \alpha \left\lfloor\frac{x - \beta}{\alpha}\right\rfloor + \beta, \quad \alpha = \frac{\max(x) - \min(x)}{2^N-1},\;\; \beta = \min(x)

This regime bounds total cache growth, enabling constant memory operation at compression ratios as low as 20–25% of the full cache size, with accuracy and contextuality nearly indistinguishable from uncompressed baselines.

LeanKV (Zhang et al., 4 Dec 2024) further refines this by introducing Hetero-KV quantization (allocating higher precision to keys than values) and per-head adaptive sparsity, while integrating on-GPU memory management to support fine-grained mixed-precision layouts, thereby offering 2.7×2.7 \times to 5.7×5.7 \times compression at near-lossless performance.

3. Personalized and Adaptive Cache Budgeting

Constant-memory can also be attained via dynamic, non-uniform allocation of cache resources according to layer- or head-specific metrics. XKV (Li et al., 8 Dec 2024) observes pronounced variance in the retention importance of KV pairs across network layers—a phenomenon quantified using per-layer attention score vectors and the metric

Ri=Topk(ni,wi)wi×100%R_i = \frac{\sum\,\text{Topk}(n_i, w_i)}{\sum\,w_i} \times 100\%

where nin_i is the retained KV pair count for layer ii, and wiw_i its attention vector. By formulating cache allocation as a combinatorial optimization, XKV assigns personalized retention budgets per layer, solving via a greedy allocation that maximizes overall contextual retention under a global memory constraint. The result is a mean 61.6%61.6\% reduction in KV cache size and up to a 5.2×5.2\times throughput boost on long-context benchmarks.

BaKlaVa (Gulhan et al., 18 Feb 2025) generalizes this approach to head-level granularity, using a one-time attention-profile driven budget search that reallocates saved memory from low-importance heads/layers to high-importance ones, yielding up to a 70%70\% compression ratio without baseline performance loss.

CAKE (Qin et al., 16 Mar 2025) introduces a "cake-slicing" methodology, assigning cache budgets in proportion to each layer’s preference

P=H1/τ1V1/τ2\mathcal{P} = \mathcal{H}^{1/\tau_1} \mathcal{V}^{1/\tau_2}

where H\mathcal{H} and V\mathcal{V} are the attention entropy and variance (spatial and temporal dispersion). Guided by these metrics, CAKE adaptively slices the global memory budget and performs cascading eviction while using a novel mean-variance eviction indicator to support dynamic, context-sensitive retention.

4. Sparse and Approximate Representations

A distinct approach to constant-memory caching transforms the dense KV tensor into sparse or compressed forms. CSR (Zhang et al., 16 Dec 2024) constructs a sparse representation using Matching Pursuit over a learned dictionary DD:

xDr(x,D,s)x \approx D \cdot r(x, D, s)

with rr constrained to ss non-zero coefficients, learned using NeuralDict. Only indices and coefficients are stored, reducing memory to as little as 1\sim 1 bit/channel, approaching the theoretical compressibility limit for practical context lengths. Inference reconstructs approximate keys and values on demand, preserving accuracy at extreme compression rates.

Streaming and subsampling-based approximations, exemplified by BalanceKV (Han et al., 11 Feb 2025), use discrepancy theory to subsample a balanced subset of key-value states via recursive streaming merge-and-reduce:

zAttn(q,K,V)2ϵsoftmax(Kq/d)2VF\|z - \mathrm{Attn}(q, K, V)\|_2 \leq \epsilon \|\mathrm{softmax}(Kq/\sqrt{d})\|_2 \|V\|_F

Such techniques rigorously bound the approximation error while capping memory usage sublinearly in sequence length.

5. Token Selection, Merging, and Residual Techniques

A spectrum of methods compresses the KV cache by contextually selecting and aggregating tokens. MorphKV (Ghadia et al., 2 Mar 2025) maintains a constant-size cache by dynamically partitioning tokens into a fixed window of recent tokens and a selection of correlated distant tokens ranked using recent attention scores:

Gi+1=TopC(Fi){R recent tokens}G_{i+1} = \mathrm{Top}_C(F_i) \cup \{\text{R recent tokens}\}

This adaptive correlation-aware retention avoids early-token bias, achieving average savings over 50%50\% without context degradation.

ZeroMerge (Liu et al., 13 Mar 2025) (ZSMerge) combines head-level token importance budgeting, residual merging (token fusion by dot-product compatibility and momentum), and compensated attention scoring:

a~t(T)=exp(qkt/Vd+αlogwt)\tilde{a}_t(T) = \exp(q^\top k_t / V_d + \alpha \log w_t)

enabling memory reductions by 20×20\times (5%5\% of original) at constant performance and improved throughput.

KeepKV (Tian et al., 14 Apr 2025) eliminates the “attention sag” of convex merging by recording merging history (electoral votes) and applying zero-perturbation merging with adaptive scaling:

ot=ipisitviipisito_t = \frac{\sum_i p_i s_i^t v_i}{\sum_i p_i s_i^t}

The merge weights are proportional to prior attention and votes, yielding exact output preservation at each step.

GraphKV (Li et al., 30 Aug 2025) advances the field by recasting token retention as a graph signal-propagation problem, penalizing redundancy through iterative cosine-similarity decay:

sj(t)=sj(t1)oiN(oj)(1eij)s_j^{(t)} = s_j^{(t-1)} \prod_{o_i \in N(o_j)} (1 - e_{ij})

Retained tokens are both high-importance and contextually diverse, with demonstrable gains under constant-memory budgets.

6. Quantization and Coding Optimized for KV Cache

Quantization-specific strategies, such as NQKV (Cai et al., 22 May 2025), exploit the normal-like distribution of KV cache blocks to perform information-theoretically optimal, blockwise quantile quantization in 4 bits ("NF4"):

IK=quantizeNF4(XK)I_K = \mathrm{quantize}_{NF4}(X_K)

These methods, combined with blockwise dequantization on inference, realize 4×4\times memory reductions with sub-percent accuracy loss, allowing for 2×2\times batch size or 4×4\times longer context within the same memory constraints.

KVComp (Jiang et al., 30 Aug 2025) extends this direction by coupling error-controlled, token-wise quantization and GPU-optimized Huffman encoding. Critical architectural co-designs—such as cache-resident decompression and fused decoding-matrix multiplication—support up to 83%83\% higher memory reduction and outperform standard attention kernels in total throughput.

7. Advanced System and Architectural Adaptations

Some solutions achieve a constant-memory profile not (only) by compressing the KV cache, but by systemic or architectural memory management. MIRAGE (Li et al., 15 Jul 2025) dynamically repurposes preallocated GPU parameter memory (from inactive models or temporarily unused layers) for KV cache, with a layer selection protocol ensuring forced reallocation does not impede computation:

TT×NTComputeT_T \times N \leq T_\mathrm{Compute}

This strategy, optimized for high-bandwidth interconnects (e.g., NVIDIA GH200), yields up to 82.5%82.5\% reduction in latency and 86.7%86.7\% higher throughput in multi-tenant LLM serving.

8. Plug-and-Play and Orthogonal Methods

Recent innovations emphasize modularity: KVCrush (Jha et al., 24 Feb 2025) introduces a binary, head-behavior-based token representation and low-overhead Hamming grouping, operating in conjunction with other approaches such as mixed precision. CommonKV (Wang et al., 22 Aug 2025) employs SVD-based cross-layer parameter sharing to align latent representations, with adaptive budget allocation based on cache similarity, and is orthogonal to quantization and eviction strategies; their integration achieves up to 98%98\% compression with limited loss.

Other frameworks, like PQCache (Zhang et al., 1 Jul 2024), apply product quantization for keys in a two-phase scheme (prefill and decoding), supporting approximate MIPS for token selection and overlapped CPU–GPU operation; or KVzip (Kim et al., 29 May 2025), which uses LLM-driven context reconstruction as a universal, query-agnostic importance metric for retention, with scoring performed in constant-memory chunked sweeps.

Conclusion

Constant-memory KV cache is an active field of research at the intersection of quantization, pruning, selection, approximation, and systems engineering. No single method dominates: mixed-precision, sparse coding, adaptive selection, and architectural strategies can be co-designed or composited for diverse model deployments. Current methods demonstrate the feasibility of reducing KV cache memory usage by up to 20×20\times or more, with minimal context or accuracy loss—enabling long-context and batch-intensive LLM inference on commodity hardware, and representing a foundational advance in the tractability and scalability of transformer models for natural language and vision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Constant-Memory KV Cache.