Papers
Topics
Authors
Recent
Search
2000 character limit reached

DynamicKV: Adaptive KV Cache Management

Updated 20 March 2026
  • DynamicKV is an adaptive key-value cache compression framework that dynamically retains important tokens for efficient long-context LLM inference.
  • It employs layer- and task-aware methods, including attention-based top-k retention, graph-based similarity, and semantic splitting to optimize compute and memory usage.
  • DynamicKV achieves high accuracy with significant memory compression, outperforming static methods by up to 57% in challenging scenarios.

DynamicKV refers to a set of adaptive Key-Value (KV) cache compression and management techniques for LLMs designed to address the limitations of static caching and compression paradigms. The primary objective of DynamicKV methods is to maximize memory and compute efficiency during long-context inference, while preserving model accuracy, by dynamically selecting, quantizing, or evicting cache entries based on importance metrics, task requirements, or input semantics. DynamicKV encompasses a broad spectrum of techniques, including layer- and task-adaptive token retention, graph-based cache management, semantic-aware retrieval, mixed-precision quantization, and dynamic splitting strategies, with each subsuming specific algorithmic and architectural advances.

1. Motivation and Limitations of Static KV Cache Compression

Traditional KV cache strategies for LLMs retain all past activations in memory, incurring O(LN)O(LN) memory overhead where LL is the number of transformer layers and NN the sequence length. Static compression methods (e.g., StreamingLLM, SnapKV, PyramidKV, H2O) either fix a uniform cache size per layer or apply pre-defined, layer-dependent cache budgets. These static schemes cannot exploit the highly variable and task-dependent token importance distributions observed across layers and attention heads. Empirical studies reveal that layerwise retention needs differ sharply between summarization, code completion, QA, and multi-document retrieval: for example, summarization often manifests pyramid-like decreasing cache needs, while code completion tasks exhibit resurgence in middle and late layers (Zhou et al., 2024). Static retention squanders memory on nonessential tokens and fails to guarantee that each layer retains the most salient context for the task at hand. DynamicKV was proposed to solve these inefficiencies through adaptive, data-driven KV retention strategies.

2. Task- and Layer-Aware Adaptive KV Retention

The DynamicKV framework (Zhou et al., 2024) dynamically optimizes token retention by allocating both global and per-layer KV budgets according to observed importance patterns during inference. Key principles and workflow:

  • Global Budgeting: Define a user-selected scaling ratio rmax(0,1]r_\text{max}\in(0,1], and allocate total KV budget B=(mean input length-ws)×rmaxB=(\text{mean input length-ws})\times r_\text{max}, where wsws is the always-retained window.
  • Attention-Based Top-K Retention: For each transformer layer \ell and head hh, compute pooled attention maps on recent tokens to identify TopK\mathrm{TopK} attention scores; retain tokens accordingly.
  • Dynamic Reallocation: Periodically during prefill, concatenate historical attention scores, compute normalized retention counts per layer, and update each layer’s KV buffer length BB_\ell by proportional allocation, i.e., Z=BCmaxmCmZ_\ell = \lfloor \frac{B\,C_\ell}{\max_m C_m}\rfloor, r=Z/Br = \sum_\ell Z_\ell/B, B=Z/rB_\ell = \lfloor Z_\ell/r\rfloor.
  • Extreme Compression Regimes: Even at 1–2% KV retention, DynamicKV matches or substantially exceeds the performance of static baselines, especially in the Needle-in-the-Haystack setting. On LongBench, DynamicKV achieves 90% full-cache accuracy on Mistral-7B at 1.7% retention, outperforming H2O, SnapKV, and PyramidKV by up to 57% (Zhou et al., 2024).

This technique empirically adapts to each input and task, ensuring maximal utility per retained token and robust performance under tight memory regimes.

3. Dynamic Importance Scoring and Graph-Based Methods

To further improve KV retention, DynamicKV-inspired frameworks such as GraphKV (Li et al., 30 Aug 2025) leverage token similarity structures and redundancy suppression:

  • Sparse Graph Construction: Each token is a node, initial importance scores (e.g., attention, 2\ell_2-norm) are assigned, and edges link top-K "source" nodes to other tokens via cosine similarity of key vectors.
  • Decay Signal Propagation: Importance propagates via a decay mechanism: after TT rounds,

I(t+1)=αWI(t)+(1α)sI^{(t+1)} = \alpha W I^{(t)} + (1-\alpha)s

or elementwise multiplicative updates sjsj(1wij)s_j \leftarrow s_j \cdot (1 - w_{ij}), suppressing tokens semantically similar to sources.

  • Dynamic Selection: Final token scores are computed post-propagation; the top-k are kept, balancing importance and diversity.
  • Plug-and-Play: GraphKV refines outputs of existing static methods without retraining, yielding empirical improvements up to 8 pp in accuracy with minimal or negative latency overhead under tight budgets (Li et al., 30 Aug 2025).

This graph-driven dynamic selection effectively avoids clusters of near-duplicate tokens and adapts to evolving context during inference.

4. Dynamic Semantic Splitting and Retrieval

DynamicKV approaches also encompass adaptive chunking and retrieval, exemplified by DynSplit-KV (Ye et al., 3 Feb 2026) and LouisKV (Wu et al., 13 Oct 2025):

  • Semantic-Aware Delimiter Selection: Identify candidate boundary tokens (punctuation, newlines) and compute attention-based importance scores for each delimiter using

si=El,h;qFi[kOiAq,k(l,h)αkDiAq,k(l,h)]s_i = \mathbb{E}_{l,h;q\in\mathcal F_i} \Biggl[ \sum_{k\in\mathcal O_i} A^{(l,h)}_{q,k} - \alpha \sum_{k\in\mathcal D_i} A^{(l,h)}_{q,k} \Biggr]

selecting boundaries which maximize retention of relevant local context.

  • Variable-to-Fixed Mapping: Map variable-length semantic blocks (computed via importance-aligned segmentation) to fixed-length matrices for efficient block-level selection and parallel computation, reducing selection overhead by up to 4.9×4.9\times (Ye et al., 3 Feb 2026).
  • Semantic-Aware Retrieval Triggers: In LouisKV, per-token retrieval is replaced with retrieval at semantic boundaries determined via cosine similarity thresholds on consecutive queries; with threshold τ\tau (e.g., 0.7),

rt=1Hh=1Hcosine(qt1h,qth)r_t = \frac{1}{H} \sum_{h=1}^H \text{cosine}(q_{t-1}^h, q_t^h)

Retrieval is triggered only when rt<τr_t < \tau, reducing retrieval overhead by up to 85% and maintaining near-lossless accuracy with up to 4.7×4.7\times speedup over state-of-the-art methods (Wu et al., 13 Oct 2025).

These advancements optimize cache access and data transfer by exploiting the temporal and semantic structure of input/output sequences.

5. Dynamic Budgeting and Performance Preservation

Static KV budgets fail to account for input or task variability. DBudgetKV (Ni et al., 24 Feb 2025) introduces a dynamic compression objective:

  • Performance-Bounded Pruning: Instead of pre-setting SB|S|\leq B, prune tokens ranked by importance and halt when the drop in last-row attention norm ai2\|a_i\|_2 for each layer ii exceeds a small threshold tt:

FiFi(S)Fit\frac{F_i - F_i(S)}{F_i} \geq t

where FiF_i is computed on the retained positions. Empirically, this halting rule ensures lossless generation and adapts to varying task or context properties.

  • Empirical Results: DBudgetKV achieves average 25%–36% compression and matches or exceeds full-cache accuracy in Llama3, Qwen2.5, and Mistral models across QA, code, and summarization tasks. It robustly outperforms fixed-budget baselines and reduces memory and latency (Ni et al., 24 Feb 2025).

The dynamic budget approach guarantees adaptation to unseen distributions and task complexities.

6. Dynamic Mixed-Precision and Quantization

DynamicKV is further extended to the mixed-precision compression domain to address throughput and memory bottlenecks:

  • Layer Importance Profiling: KVmix (Li et al., 18 May 2025) leverages gradient-based sensitivity analysis to compute layer-specific importance for Keys/Values by evaluating the 2\ell_2-norm of gradients w.r.t. projection matrices.
  • Adaptive Bit Allocation: Top-quantile important layers are assigned higher precision (3 or 4 bits); less critical layers are quantized aggressively (2 bits), keeping recent pivotal context (RPC) full precision. At each decoding step, old tokens are quantized as needed, recent tokens retained at higher precision.
  • Implementation: Efficient low-bit CUDA kernels support seamless fusion of quantization and attention operations. Tested on Llama and Mistral, KVmix provides near-lossless accuracy (<<1% drop), 4.9×4.9\times memory compression, and 5.3×5.3\times throughput gains (Li et al., 18 May 2025).

Adaptive quantization harmonizes memory savings and accuracy via data-driven, layer-wise policies.

7. Tri-State, Per-Layer Adaptive Management

ARKV (Lei et al., 19 Feb 2026) exemplifies DynamicKV applied to tri-state per-layer cache management under memory budgets:

  • Attention-Driven OQ Ratio Estimation: For each layer, compute entropy, variance, kurtosis of post-softmax attention distributions, synthesize into an OQ-ratio ρ\rho_\ell per layer for original/quantized partitioning.
  • Heavy-Hitter Scoring: During decoding, compute μk\mu_k (average attention to token kk) and σk\sigma_k (its variance), and score Sk=μk+γσk2S_k = \mu_k + \gamma \sigma_k^2. Tokens are assigned to Original, Quantized, or Evicted states per-layer, with protected sliding window for recency.
  • Empirical Viability: On LongBench and GSM8K, ARKV achieves \sim97% accuracy preservation (vs. baseline), average 4×4\times memory reduction, and maintains high throughput, outperforming static quantization baselines (Lei et al., 19 Feb 2026).

DynamicKV-driven tri-state policies introduce fine-grained, per-layer, and per-token control over cache precision and retention.


DynamicKV Method Key Mechanism Representative Results/Claims
Task/Layer Adaptive Per-layer attention-based token retention 1.7%1.7\% cache, 85%85\% performance; up to +57%+57\% vs SOTA at <1%<1\%
Graph-Based (GraphKV) Redundancy-suppressing decay in similarity graph Up to +8+8 pp accuracy under tight budgets; negligible/negative latency
Dynamic Splitting (DynSplit-KV) Semantic-aware splitting, variable-to-fixed mapping 2.2×2.2\times GPU speedup, 2.6×2.6\times memory reduction; 0.2%0.2\% KV gets perf.
Dynamic Budget (DBudgetKV) Halt pruning by attention norm threshold 25%+25\%+ memory saving; matches full-cache accuracy, task-adaptive
Mixed-Precision (KVmix) Layer-sensitivity profiling, adaptive bits 4.9×4.9\times memory, 5.3×5.3\times throughput; <1%<1\% accuracy drop
Tri-State Adaptive (ARKV) Entropy/variance/kurtosis OQ partition, heavy hitter 97%97\% accuracy, 4×4\times memory reduction, 86%86\% throughput

8. Implementation, Performance, and Limitations

DynamicKV methods are designed for straightforward integration—typically as inference-time, plug-in modules requiring no retraining or model modification. Most approaches amortize overheads with vectorized CUDA/Triton kernels, memory-efficient top-K gathers, and fused attention/quantization operators (Wu et al., 13 Oct 2025, Li et al., 18 May 2025). Performance gains are consistent across varying sequence lengths, tasks, and model scales, with pronounced advantages in long-context and few-shot settings. Notable limitations include possible performance gaps in extreme compression for certain tasks, the necessity of offline profiling for gradient-based methods, and small startup costs for semantic splitting due to prefill attention pass. Extensions to multimodal, retrieval-augmented, or learnable budget controls remain open research directions.

References

  • (Zhou et al., 2024): "DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs"
  • (Li et al., 30 Aug 2025): "GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction"
  • (Wu et al., 13 Oct 2025): "LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences"
  • (Ye et al., 3 Feb 2026): "DynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM Inference"
  • (Ni et al., 24 Feb 2025): "DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance"
  • (Li et al., 18 May 2025): "KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache"
  • (Lei et al., 19 Feb 2026): "ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DynamicKV.