KV Cache Growth Control in Transformers

Updated 27 March 2026

KV cache growth control is a system of algorithms and strategies that manage the space complexity of key-value caches in Transformer-based models.
It employs methods such as token-selective retention, sliding window, and reinforcement learning to balance memory usage and throughput during autoregressive decoding.
Advanced techniques including quantization, compression, and hybrid block-wise approaches enable efficient handling of long-context generation in LLMs and multimodal models.

KV cache growth control refers to the set of algorithms, mechanisms, and system strategies designed to bound, compress, or dynamically manage the space complexity of the key-value (KV) cache in Transformer-based models. As context lengths scale into tens or hundreds of thousands of tokens for language, vision, and multimodal models, the linear accumulation of KV pairs during autoregressive decoding becomes the dominant memory and computational bottleneck. Effective KV cache growth control is essential for practical long-context generation, throughput optimization, and resource-efficient inference across modern large models. The following sections present major paradigms, methodologies, theoretical tools, system implementations, and empirical results in KV cache growth control as established by the recent literature.

1. Growth Bottlenecks and Theoretical Framework

The KV cache in Transformer architectures accumulates a new set of key and value vectors per token, per head, and per layer, scaling memory usage as $O(L \cdot d \cdot N)$ for a context of length $L$ (tokens), hidden size $d$ , and $N$ attention heads or layers. In unified autoregressive video models, a single 48-frame $384 \times 672$ video can generate $>$ 50K tokens—orders of magnitude above the typical training window of 4K–8K—resulting in end-to-end inference dominated by attention over a massive KV cache (Li et al., 7 Jan 2026). For LLMs, this linear growth rapidly outpaces the parameter size and, if uncontrolled, results in either out-of-memory errors or severe throughput degradation (Liu et al., 8 Aug 2025, Li et al., 7 Jan 2026, Kampeas et al., 6 Jan 2026).

Memory scaling formulas:

$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$

( $B$ = bytes per element; e.g., FP16 $B=2$ , FP32 $B=4$ )

Performance strictly depends on both the raw memory footprint and the bandwidth required to load increasingly large K/V matrices at each generation step.

2. Token-Selective and Budgeted Retention Strategies

A core class of methods enforces a global or per-layer hard budget—either as a total token count or per-interval allocation—via token selection and importance-based eviction. These may be static, requiring no model or parameter updates, or adaptive, depending on runtime statistics:

Heavy-hitter tracking: Methods such as H2O and SnapKV-D maintain cumulative attention scores per token and greedily evict those receiving the least attention (Liu et al., 12 Dec 2025). SnapKV-D generalizes the SnapKV prefill scheme to long decoding: every $L$ 0 tokens, it selects cache entries with the highest cumulative windowed attention, preserving the task-critical "heavy hitters." These methods are dominant for reasoning tasks, with SnapKV-D and H2O outperforming alternatives across GSM8K, MATH500, and similar benchmarks for cache budgets $L$ 1 (Liu et al., 12 Dec 2025).
Sliding window: The EvictOldest/FIFO strategy keeps only the most recent $L$ 2 tokens, ensuring contiguous cache layout and preserving positional fidelity, at the cost of discarding long-term context (Poudel, 23 Oct 2025).
Hybrid block-wise and anchor retention: LASER-KV implements a protection-divisor budgeting scheme, maintaining both a global anchor region (primacy block) and local sliding window per block, with long-term recall budgeted via Exact-LSH selection. This avoids the positional disruption and semantic loss typical in pure recency or attention-based methods, ensuring stable recall at 128k context (Sood et al., 2 Feb 2026).
RL-based adaptive eviction: KV Policy (KVP) agents are trained (offline) via reinforcement learning to rank tokens by estimated future utility, achieving state-of-the-art adaptive budgeted eviction with low overhead and strong generalization to unseen domains (Moschella et al., 10 Feb 2026).

3. Quantization, Compression, and Merging

To address dimensionality and redundancy, several studies introduce lossy and lossless compression layers, quantization, or KV merging schemes:

Quantization: PackKV applies per-token, low-bit quantization and bit-packing for both K and V, reducing memory usage by an average of $L$ 3 (K) and $L$ 4 (V), while fusing decompression with attention mat-vec computation, resulting in $L$ 5– $L$ 6 throughput (Jiang et al., 30 Dec 2025). VQKV applies vector quantization through codebooks, representing each vector by integer indices and reconstructing at decode time, yielding $L$ 7 memory savings and $L$ 8 accuracy retention (Wang et al., 17 Mar 2026).
Dimensional compression and KV reuse: KV-CAR compresses K/V via per-layer autoencoders and provides head-wise reuse, storing only distinct representations when similarity exceeds a high threshold, reducing total cache memory by up to $L$ 9 on benchmark LLMs (Roy et al., 7 Dec 2025).
Merging with compensation: KeepKV merges less-important cache entries (as determined via similarity) into preserved entries, recording electoral votes and mathematically adjusting attention scoring to achieve zero perturbation at the current step. This outperforms naive merging or pruning, maintaining generation quality even at $d$ 0 cache budgets with $d$ 1 throughput gain (Tian et al., 14 Apr 2025). ZSMerge merges residual tokens into fixed slots, compensating attention mass and demonstrating $d$ 2 compression with minimal impact on quality or throughput even at $d$ 3k context in LLaMA2-7B (Liu et al., 13 Mar 2025).

KV cache growth is particularly severe in models handling long sequences in vision and video:

Spatiotemporal decay and anchoring: PackCache—deployed for unified video models—allocates persistent budget quotas to "semantic anchors" (prompt/image conditions) and applies exponentially decaying budgets over previous frames, guided by empirically observed attention decay (Li et al., 7 Jan 2026). Frames are compacted or evicted according to their temporal lag, while a spatial positional rebase maintains coherent 3D RoPE structure. PackCache achieves up to $d$ 4 acceleration and a $d$ 5 reduction in memory relative to baseline in 48-frame video generation.
Multi-scale and layer-aware methods: AMS-KV and ScaleKV, designed for visual autoregressive transformers and next-scale prediction architectures, segment layers into "drafters" (high cache demand, broad attention) and "refiners" (low cache demand, local attention) (Xu et al., 20 Nov 2025, Li et al., 26 May 2025). Early, coarse scales are always cached (condensed scales), and budgets are dynamically assigned via cross-scale KV similarity or per-layer selectivity indices. Memory reductions reach $d$ 6 and allow larger batches and higher throughput.
Multimodal frequency analysis: FlashCache ranks KV pairs by deviation in frequency domain (rather than attention scores), preserving "outlier" tokens with high criticality for inference. Dynamic per-layer budgets are allocated according to high-frequency "energy," yielding $d$ 7 KV memory savings and $d$ 8 decoding speedup in multimodal models (Yang et al., 20 Nov 2025).

5. System-Level and Enterprise Solutions

Scaling KV cache beyond a single GPU or across distributed inference necessitates advanced memory management:

LMCache: This enterprise-scale caching layer manages KV storage, movement, and orchestration across GPU, CPU, and storage layers (Cheng et al., 8 Oct 2025). A first-class API exposes pinning, lookup, clear, move, and compress operations, supporting features such as watermark-based eviction, adaptive offloading, reference counting, and hybrid device placement. In combination with vLLM, LMCache achieves up to $d$ 9 throughput gains and robust, steady-state memory utilization under high concurrency.
Joint encoding for high-concurrency serving: Joint Encoding (Fast-Fusion) fuses similar KV-cache blocks (across requests or input chunks) using high-threshold cosine similarity. Blocks with similarity above $N$ 0 are merged, shrinks memory by up to $N$ 1, and substantially boosts throughput without custom hardware or kernel modifications (Kampeas et al., 6 Jan 2026).

6. Task-Adaptive and Semantic-Aware Policies

Task and data domain critically influence optimal cache management:

Task-aware compression: DynamicKV adaptively redistributes per-layer budgets based on observed cross-layer activation and task-driven attention patterns, enabling as little as $N$ 2– $N$ 3 cache retention while maintaining $N$ 4 accuracy on LongBench (Zhou et al., 2024).
Semantic- and segment-aware compression: SABlock segments texts into linguistically coherent units, then applies adaptive, budget-driven block-size selection to minimize semantic fragmentation. This yields $N$ 5 retrieval accuracy with only $N$ 6 entries on Needle-in-a-Haystack and $N$ 7 peak memory reduction at $N$ 8k context (Chen et al., 26 Oct 2025).
Conversational context via episodic partitioning: EpiCache clusters conversation history into topic-based episodes, with block-wise prefill eviction and adaptive budget allocation according to layer sensitivity, yielding $N$ 9– $384 \times 672$ 0 memory reduction and $384 \times 672$ 1 speedups in long conversational QA (Kim et al., 22 Sep 2025).

7. Preservation of Structural Constraints and Practical Guidelines

Multiple studies document the dangers of unprincipled or position-disruptive eviction:

Architectural context limits and positional fidelity: Accumulated KV length must not exceed the model's pretrained context window. Non-contiguous pruning (e.g., attention-score-top pruning) can scramble positional encoding (RoPE), sharply degrading output coherence, even with high retention ratios. SlidingWindowGist, which preserves initial contiguous context blocks, offers better coherence at a fraction of the memory (Poudel, 23 Oct 2025).
Calibration: Thresholds for memory use, retention, and block size should be set with both model and task requirements in mind, and fidelity metrics (e.g., positional disruption, attention loss) monitored for semantic drift (Poudel, 23 Oct 2025, Sood et al., 2 Feb 2026).

8. Quantitative Results and Comparative Metrics

A non-exhaustive summary of empirical performance:

Method	Compression Ratio	Speedup	Accuracy/Fidelity Drop	Domain
PackCache	$384 \times 672$ 2– $384 \times 672$ 3 end-to-end, up to $384 \times 672$ 4 tail	up to $384 \times 672$ 5	$384 \times 672$ 60.2 FID	video (Li et al., 7 Jan 2026)
AMS-KV	$384 \times 672$ 7 ( $384 \times 672$ 8 reduction)	$384 \times 672$ 9 latency	$>$ 02\% FID	visual auto. (Xu et al., 20 Nov 2025)
KVCrush	$>$ 1	$>$ 2 lat.	$>$ 3 average accuracy	LLM (Jha et al., 24 Feb 2025)
VQKV	$>$ 4 memory reduction	$>$ 5 longer context	$>$ 61\% avg. acc.	LLaMA (Wang et al., 17 Mar 2026)
SABlock	$>$ 7 memory	$>$ 8 speed (128k ctx)	$>$ 91\% score drop	LLM (Chen et al., 26 Oct 2025)
Joint Encoding	$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 0	$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 1 throughput	$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 21\%	LLM (Kampeas et al., 6 Jan 2026)
FlashCache	$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 3 memory	$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 4 decode	$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 50.2\% accuracy	multimodal (Yang et al., 20 Nov 2025)
EpiCache	$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 6– $M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 7	$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 8	$M_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B$ 990\% full accuracy retained	multi-turn QA (Kim et al., 22 Sep 2025)
DynamicKV	$B$ 0 (1.7%)	–	$B$ 1 full accuracy	LLM (Zhou et al., 2024)

These results confirm that effective KV cache growth control can deliver $B$ 2– $B$ 3 memory savings, $B$ 4– $B$ 5 throughput improvements, and maintain quality within $B$ 6– $B$ 7 of full-cache baselines when parameterized and calibrated appropriately for the task and architecture.