Papers
Topics
Authors
Recent
Search
2000 character limit reached

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

Published 17 Aug 2025 in cs.AR, cs.AI, and cs.PF | (2508.13231v2)

Abstract: LLM inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects such as NVLink and LPDDR5X, modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM, making heterogeneous memory systems a practical solution. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. To our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.

Summary

  • The paper introduces a rigorous mathematical framework with simulated annealing to optimize KV cache placement and improve inference throughput by up to 5.87×.
  • It examines the dynamic variation of token importance and its impact on memory bandwidth, highlighting the challenges of reducing migration overhead.
  • Empirical evaluations on heterogeneous systems, such as NVIDIA GH200, reveal significant performance gains over static and reactive cache scheduling strategies.

Formal Analysis of Dynamic Key-Value Cache Placement for LLM Inference Acceleration in Heterogeneous Memory Systems

Motivation and Problem Formulation

Transformer-based LLMs exhibit pronounced bottlenecks during inference, primarily due to frequent read-dominant accesses to the key-value (KV) cache in the decode stage. The evolution of token importance, revealed through fluctuating attention scores, necessitates full KV cache accessibility for optimal model output—causing substantial load on both memory bandwidth and capacity. The mismatch between the rapidly advancing compute capabilities of modern AI hardware and relatively stagnant memory subsystem bandwidth exacerbates this bottleneck, restricting throughput scalability.

The integration of high-bandwidth memory (HBM) with increasingly comparable off-package DRAM (e.g., LPDDR5X, interfaced via NVLink) on AI platforms such as NVIDIA GH200 offers a heterogeneous memory system where runtime KV cache placement is non-trivial; it directly modulates aggregated bandwidth utilization while obeying strict high-capacity constraints. This paper provides a rigorous mathematical formalism for the dynamic KV cache placement problem, aiming to characterize attainable performance upper bounds rather than prescribing specific scheduling algorithms.

Temporal Variability of Token Importance and Its Implications

Token importance is highly dynamic during LLM inference, with distinct tokens exhibiting fluctuating attention scores that reflect changing contextual dependencies throughout sequential decoding steps (Figure 1). Figure 1

Figure 1

Figure 1: Visualization of attention score variation for token 2777 at layer 8, illustrating substantial temporal fluctuation in contextual importance.

This volatility implies that adaptive token bypassing, while reducing bandwidth pressure, cannot relax the full KV cache retention requirement. Hence, memory systems must be equipped to dynamically manage KV entries in real time, tracking evolving importance without sacrificing accessibility—posing a severe challenge as LLMs become longer-context and more sophisticated. HBM’s limited capacity thus becomes a core constraint motivating heterogeneous memory orchestration.

Heterogeneous Memory Architecture: Architectural and Performance Considerations

Recent advances in LPDDR interfaces and high-speed interconnects have narrowed the bandwidth disparity between HBM and external DRAM. For example, GH200 features HBM3 with up to 4 TB/s bandwidth and 24 GB capacity alongside LPDDR5X-backed DRAM delivering 900 GB/s interconnect and 500 GB/s DRAM bandwidth with notably higher capacity. Efficient KV cache distribution across this memory hierarchy can drastically impact inference throughput, underscoring the necessity of intelligent placement strategies as bandwidth utilization increasingly hinges on runtime data locality.

Mathematical Framework for Placement Optimization

The paper models inference latency as a sum of stepwise delays, each determined by maximum access time to HBM and off-package DRAM, modulated by KV cache placement and migration operations. Given a scheduling decision S(n,l)\mathcal{S}_{(n,l)} per step (composed of migration and write/read volume for each memory type), latency is derived analytically using bandwidth and migration overhead parameters from the underlying hardware configuration.

Formally, the optimization target is to minimize overall decode latency across all inference steps, subject to HBM capacity constraints. This encapsulates both the placement and migration schedule for each KV cache entry, given access patterns and memory architecture.

Theoretical Upper Bound Discovery Using Simulated Annealing

Exhaustive search of the placement solution space is infeasible due to its combinatorial complexity. The paper leverages simulated annealing (SA), assuming perfect knowledge of future access patterns, to approximate a near-optimal scheduling solution. SA dynamically adjusts look-ahead window size (WW) and migration ratio (RR), probabilistically accepting migration schedules with lower latency or exploring coordinated parameter perturbations to escape local minima.

This approach quantifies a theoretical performance ceiling, unattainable by any real-time system without omniscient foresight but revealing substantial potential for future heuristics employing predictive modeling or reinforcement learning. The SA-guided placement routinely identifies optimal migration points and allocation strategies that minimize unnecessary migration and maximize aggregate bandwidth utilization.

Empirical Evaluation and Sensitivity Analyses

A behavioral simulator is used, parameterized on GH200’s memory hierarchy and LLaMA-3.1-8B access patterns, to benchmark several placement strategies: unlimited HBM (ideal), static placement, reactive (LRU) scheduling, Quest-style page granularity scheduling, and the SA-guided upper bound.

Throughput results demonstrate that the SA-guided scheduler achieves up to 5.87×\times higher normalized tokens/sec than static placement, with performance approaching the unlimited HBM case under low importance variation. Migration overhead undermines reactive and page-based approaches under low attention sparsity, while high sparsity benefits all strategies by reducing active working set size. Figure 2

Figure 2

Figure 2: Left—tokens/sec normalized relative to unlimited HBM across synthetic importance variation scenarios; right—HBM hit rates at 60% sparsity.

Sensitivity analysis reveals that dynamic in-memory placement yields superior gains for inference tasks where token importance remains stable across decoding steps. High variation leads to excessive migration, reducing effective bandwidth utilization. The HBM hit rate, critical to throughput, is tightly regulated under SA, validating its heuristic advantage.

Theoretical and Practical Implications

The results and formalism illustrate large performance gaps between naive placement heuristics and theoretically optimal dynamic schemes, indicating substantial room for systems research in adaptive memory scheduling for LLMs. Practically, runtime solutions should evolve towards leveraging predictive access pattern analytics, RL, or online learning to adapt placement dynamically, exploiting near-optimal aggregate bandwidth utilization in heterogeneous memory environments.

From a theoretical perspective, this paper provides the first formal foundation for KV cache scheduling in LLM inference over heterogeneous memory, exposing explicit bounds and stimulating subsequent algorithmic exploration for aggressive latency minimization.

Conclusion

This paper establishes a rigorous mathematical framework for dynamic KV cache placement in LLM inference on heterogeneous memory architectures, deriving a theoretical upper bound using simulated annealing with perfect foresight. Simulation results demonstrate up to a 5.87×\times throughput improvement over static strategies, motivating the pursuit of advanced adaptive scheduling methods to approach this bound and fully harness emerging AI hardware memory hierarchies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.