Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Key-Value Cache Placement

Updated 26 February 2026
  • Dynamic key-value cache placement is an adaptive strategy that manages memory in transformer LLMs by selectively retaining, evicting, or migrating key-value pairs.
  • It leverages a mix of heuristic, graph-based, reinforcement learning, and randomized approaches to balance model output fidelity with system latency and throughput.
  • The strategy effectively addresses distributed and heterogeneous memory challenges, achieving notable reductions in memory usage and improvements in inference speed.

Dynamic key-value (KV) cache placement encompasses mechanisms for adaptively selecting which key-value pairs are retained, evicted, or migrated in memory-constrained or distributed LLM inference settings. This problem arises ubiquitously in transformer-based autoregressive generation, distributed and hierarchical serving systems, and modern heterogeneous AI hardware, where both GPU/DRAM and advanced memory hierarchies serve as the backing store. Dynamic placement strategies balance latency, memory footprint, throughput, and model output fidelity by leveraging a wide range of heuristics, optimization routines, and increasingly learning-based policies.

1. Fundamental Problem and Architectural Motivations

The KV cache contains all key and value tensors generated for past tokens during sequence generation. In transformer decoders, this cache allows constant-time retrieval of past representations and amortizes computation in the autoregressive process. However, cache sizes scale linearly with context length, layers, heads, and precision: for an L-layer, H-head transformer and sequence length T, memory usage is BKV=2LHTdkbB_{KV} = 2 L H T d_k b bytes for head size dkd_k and per-element bitwidth bb (Poudel, 23 Oct 2025). Without adequate control, cache usage quickly threatens device limits and leads to either memory exhaustion or generative collapse.

Moreover, resource imbalances in disaggregated serving (separating prefill/decoding pipelines), heterogeneous memory systems (HBM/DRAM), and multi-GPU clusters all exacerbate the need for adaptive, dynamic strategies for cache placement and migration (Fang et al., 17 Aug 2025, He et al., 15 Oct 2025, Wu et al., 26 Jan 2026). The challenge is to select—at inference time and possibly subject to rate, load, and architectural constraints—which tokens are most valuable or costly to retain, where to place them, and when to evict or offload.

2. Algorithmic Strategies in Dynamic Placement

Dynamic placement is operationalized via a spectrum from heuristic eviction (recency, LRU, top-attention) to learning-driven, optimization-based, and graph-theoretic approaches.

Heuristic and Pre-attention Policies

  • Pre-attention hashing (HashEvict): Projects queries and keys into low-bit hash codes and computes Hamming distances to identify and evict tokens least likely to receive future attention, avoiding expensive score recomputation at each step. Experimental results indicate that pre-attention policies such as HashEvict can compress KV cache by 30–70% with only 1–3% degradation in performance, outperforming L2-norm baselines, especially at tight budgets (Liu et al., 2024).
  • Attention-score-based top-k and sliding window ("EvictOldest") policies are computationally cheap but often ignore shifting token importance and positional structure, leading to severe collapse when context window limits are breached or contiguity is lost (Poudel, 23 Oct 2025).

Graph-Based and Adaptive Selection

  • GraphKV: Constructs a sparse token similarity graph, with nodes for tokens and cosine-similarity-weighted edges, and applies a decay-signal propagation to iteratively suppress redundant tokens in neighborhoods of high importance. This enables contextually diverse, dynamically adjustable retention, consistently outperforming static heuristics and yielding ∼3–8% absolute accuracy gains over SnapKV and PyramidKV baselines with only ±10% change in latency (Li et al., 30 Aug 2025).
  • CAKE: Frames the cache as a cake-slicing problem and allocates per-layer cache budgets adaptively, guided by per-layer attention entropy and variance. A per-token eviction indicator combining mean and variance over a temporal window is used to select tokens most robustly needed downstream. CAKE achieves substantial memory reduction (>48% at 128k context) and >10× speedup with negligible accuracy loss, allocating more budget to mid-layers with high attention dispersion (Qin et al., 16 Mar 2025).

Reinforcement Learning and Learning-to-Rank

  • KV-Policy (KVP): Treats eviction as a budget-agnostic permutation learning problem: lightweight per-head RL agents are trained offline on sequence traces to learn a ranking function over key-value entries, directly optimizing for future utility as measured by downstream attention. Deployments generalize across models and tasks, outperforming all prior baselines by 10–15 points in accuracy on long-context reasoning benchmarks (e.g., RULER, OASST2), and incurring only ∼1% additional inference flops (Moschella et al., 10 Feb 2026).

Randomized and Joint Routing–Eviction Frameworks

  • Randomized Leaf-Token Eviction (RLT): Evicts non-marked tokens uniformly at random, achieving O(logB)O(\log B)-competitiveness provably, versus the O(B)O(B) competitive ratio of deterministic LRU. This result holds for both online single-query and batched settings (Wu et al., 26 Jan 2026).
  • Learning-Based Greedy Routing (LBGR): Jointly learns to route query batches to workers (LLM servers) by regressing on cache hit-rates, queue backlog, and expected latency. This, combined with RLT, yields 6.9× cache hit-rate, 11.9× latency, and 77.4% throughput gain over CacheAware+LRU on multi-LLM benchmarks.

Dynamic Placement in Heterogeneous Memory

  • Heterogeneous Scheduling: With High-Bandwidth Memory (HBM) and slower DRAM, dynamic scheduling involves continually migrating high-future-use KV entries into HBM to maximize bandwidth utilization and minimize decode latency. Static fill-until-spill approaches are suboptimal; formal mixed-integer programming shows up to 5.87× speedup possible over such baselines under realistic access traces (Fang et al., 17 Aug 2025).
  • BanaServe: In distributed LLM serving, dynamic placement is implemented as both attention-head-level KV migration (parallelized cross-GPU), layer-level whole-weight migration, and global cache-store sharing that decouples routing from cache locality. Latency incurred by migration is hidden via overlapped pipelines. BanaServe achieves up to 3.9× throughput and 78.4% lower latency compared to vLLM, and scales robustly across loads and context lengths (He et al., 15 Oct 2025).

3. Optimization Formalisms and Theoretical Properties

Many recent studies frame the dynamic placement challenge as explicit optimization problems:

  • Layer- and task-aware budgeting (Qin et al., 16 Mar 2025): CAKE sets up per-layer cache allocation under a global constraint

max{Kl}Accuracy({Il,Kl})s.t. lKlM\max_{\{K_l\}} \text{Accuracy}(\{I_l, K_l\}) \quad \text{s.t. } \sum_{l} K_l \leq M

where per-layer eviction indicators IlI_l are derived from spatial–temporal statistics.

  • KV placement in heterogeneous memory (Fang et al., 17 Aug 2025): Placement variables xt,i{0,1}x_{t,i}\in\{0,1\} denote per-token/timestep residence (HBM vs DRAM). The objective is to maximize bandwidth utilization, or equivalently, minimize total latency TtotalT_{\rm total}, subject to per-time HBM capacity:

t:ixt,isiCHBM\forall t: \sum_{i} x_{t,i} s_i \leq C_{HBM}

  • Reinforcement Learning and Policy Gradients: In KVP, the policy is a Plackett–Luce distribution over permutations with per-budget, future-attention-weighted reward functions. The RL objective is budget-agnostic, ensuring robustness across different cache constraints (Moschella et al., 10 Feb 2026).
  • Randomization vs Determinism: RLT achieves a tight O(logB)O(\log B) competitive ratio and is proven optimal among randomized online eviction strategies (Wu et al., 26 Jan 2026).

4. Impact on Memory Usage, Throughput, and Output Fidelity

Dynamic placement regimes offer quantifiable benefits across multiple axes:

Technique Memory Reduction Accuracy vs. Full Cache Throughput Gain Latency Reduction
HashEvict 30–70% −1–3% (50–70% budget) 1.5–2× (prefill) 1.1–1.2× (decoding)
CAKE >48% (128k ctx) >95% (LongBench) >10× (128k ctx) −32–90%
GraphKV 50% (by ratio) +3% absolute ±10% (decoding)
BanaServe 1.2–3.9× 3.9–78.4%
LBGR+RLT +77.4% 11.9× (latency)
KVP +10–15pp (accuracy) ∼1% overhead

In stateful, multi-turn LLM protocols, cache overflow beyond the model context window (WmaxW_{max}) can induce catastrophic loss of output coherence unless placement respects not only size but also contiguous token spans, necessary for RoPE and other positional encoding schemes (Poudel, 23 Oct 2025).

5. Placement and Eviction in Distributed and Heterogeneous Memory

In high-throughput inference clusters, dynamic placement extends to routing and migration:

  • Attention- and layer-level migration (He et al., 15 Oct 2025): Dynamic load balancing is achieved by partitioning attention heads and migrating their KV storage (or entire layer weights+KV) across GPUs. This allows fully load-agnostic routing, eliminating hotspots and decoupling cache placement from query scheduling.
  • Global KV Store: Shared, overlapped layer-wise KV transmission pipelines enable all prefill GPUs to access a global, up-to-date KV cache, fully hiding communication latency behind compute.
  • Hierarchical Scheduling: In HBM/DRAM systems, optimal policies must dynamically select and migrate the most valuable KV entries between fast and slow memory tiers, adjusting at each decode step subject to capacity, page-migration, and runtime limits (Fang et al., 17 Aug 2025). Empirical results demonstrate attainable speedups of 4–6× over static allocation baselines.
  • Joint Optimization of Routing and Eviction: Combining learning-based query routing (LBGR) with randomized eviction (RLT) ensures both per-worker cache hit rate and global latency/makespan minimization. Extensive experiments validate theory, with major improvements in end-to-end service QoS (Wu et al., 26 Jan 2026).

6. Position Integrity, Model Constraints, and Future Directions

Eviction policies must respect model constraints beyond cache size. In particular, for models with learned or rotary positional encodings, non-contiguous pruning violates positional fidelity (PF) and can severely degrade output (Poudel, 23 Oct 2025). Empirical evidence suggests that simplistic rigid chunking (preserving an initial "gist" or sliding window) can, in some scenarios, outperform advanced high-retention evictors if they fragment positional structure.

Open challenges include:

  • Hyperparameter selection and tuning for graph- or RL-based schemes (e.g., number of rounds, decay rates, per-layer budgets).
  • Theoretical analysis of convergence and optimality in more complex settings.
  • Extension to multi-GPU, sharded environments with distributed policy synchronization and data movement.
  • Joint scheduling at the intersection of cache importance, migration cost, and workload evolution in the context of rapidly evolving AI hardware.

Overall, dynamic KV cache placement is emerging as a pivotal subfield in efficient, scalable LLM inference, linking algorithmic memory management, distributed systems, and learning-based resource control (Qin et al., 16 Mar 2025, Moschella et al., 10 Feb 2026, Fang et al., 17 Aug 2025, He et al., 15 Oct 2025, Wu et al., 26 Jan 2026, Poudel, 23 Oct 2025, Li et al., 30 Aug 2025, Liu et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Key-Value Cache Placement.