Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Level Cache Architectures for LLM Inference

Updated 8 March 2026
  • Multi-level cache architectures are systems that partition key-value activations across hierarchical memory tiers to enhance throughput and reduce latency in LLM inference.
  • They utilize techniques like low-rank compression, dynamic sparsity, and selective caching to optimize storage while maintaining model accuracy.
  • Asynchronous data movement and multi-tier orchestration enable energy-efficient scaling and improved performance in large-scale LLM deployments.

The design and implementation of multi-level cache architectures for LLM inference has become central as context windows, model sizes, and real-world deployment demands outpace on-device memory capabilities. By partitioning and optimizing the storage and retrieval of intermediate activations (almost always key-value "KV" pairs of the attention mechanism) across device, system, and often global networks, these approaches dramatically increase throughput, reduce latency, lower cost, and improve sustainability for LLM inference at scale. This article provides a technical synthesis of state-of-the-art multi-level cache designs, their principles, algorithmic foundations, key performance characteristics, and engineering trade-offs as realized in production and research systems.

1. Foundations and Motivations for Multi-Level Caching

Modern autoregressive LLMs retain an ever-growing key-value (KV) cache to accelerate incremental decoding. Without architectural interventions, KV cache memory scales as O(batch size×sequence length×dmodel)O(\text{batch size} \times \text{sequence length} \times d_\text{model}), often dwarfing available high-bandwidth device memory (e.g., GPU HBM). This fundamental scaling bottleneck has been exacerbated by the advent of million-token contexts, high-throughput serving use-cases, retrieval-augmented generation (RAG), and multi-agent or multitask settings.

The impetus for multi-level cache architectures is thus threefold:

  • Breaking the GPU memory/batch size ceiling: Enabling larger batch sizes and ultra-long contexts by offloading KVs to more capacious but slower memory (CPU, SSD, network, even distributed systems).
  • Reducing operational (energy/carbon) cost: By leveraging older GPUs or memory-optimized compute, and minimizing hardware footprint in constrained or legacy environments.
  • Cutting latency and maximizing throughput: Through decoupling of prefill and decode, serving fine-grained cache needs from the fastest possible tier, and pipelining compute and I/O.

These motivations inform all contemporary systems, which exploit structure (e.g., low-rankness, locality) in model activations, as well as system-level parallelism, prefetching, and overlapping data movement and compute (Sun et al., 2024, Peng et al., 2024, Gao et al., 10 Apr 2025).

2. Multi-Level Cache Hierarchies: Designs and Variants

A multi-level cache for LLM inference organizes KV data across a memory/storage hierarchy, typically spanning:

  • L1 (GPU Device/HBM): Fastest latency; hosts recent or high-reuse activations.
  • L2 (CPU DRAM / Host Memory): Larger but slower; main spillover tier for KVs or activations.
  • L3 (Disk/SSD/NVMe): Cold storage for rarely-accessed KVs, with high capacity.
  • L4 (Remote Store, Distributed DRAM/SSD, Networked Caches): For node disaggregation, multi-engine or cross-query reuse, or even geographically-disparate serving.

Notable instantiations include:

  • ShadowKV: Maintains a low-rank key cache on GPU (via SVD compression), offloading the dense value cache to CPU memory, with on-the-fly chunk-wise sparse KV selection ((Sun et al., 2024), see §3).
  • LMCache/Shared-Disk Cache: Extends L1/L2/L3/L4 to support persistent SSD, remote, and multi-instance caching, with chunked, pipelined RDMA/NVMe I/O and fine-grained control APIs (Lee et al., 16 Apr 2025, Cheng et al., 8 Oct 2025).
  • M2Cache: HBM (for per-layer, per-neuron mixed-precision LRU caches), DRAM (for layer-wise, pattern-aware "window"), SSD (for full model weights); dynamically mixed-precision and sparsity optimize transfer and compute (Peng et al., 2024).
  • UniCAIM: Hardware-level unification of cache management and compute-in-memory, using static and dynamic KV pruning directly on FeFET-based CAM/CIM arrays (Xu et al., 10 Apr 2025).
  • SkyMemory: Extends the cache to LEO satellite edge (LEO, ISL mesh) for ultra-distributed LLM serving (Sandholm et al., 20 May 2025).
  • CLO: Retains only head-wise coarse-grained "hard-to-reuse" KVs on GPU, offloading most to CPU, with a zero-copy PCIe data-movement engine (Yi et al., 18 Nov 2025).
  • LLMCache: Introduces a per-layer, semantic fingerprint-based cache, matching activations via compact projections, thus allowing caching at any transformer layer for both encoder and decoder architectures (Bansal, 18 Dec 2025).

A typical dataflow involves hierarchical cache lookup, with rapid fallbacks through each tier (GPU→CPU→SSD→remote), asynchronous prefetching, and often pipelined compute/I/O overlap to hide data movement within attention or projection computation (Lee et al., 16 Apr 2025, Cheng et al., 8 Oct 2025).

3. Compression, Low-Rank, and Structure-Aware KV Organization

For further KV footprint reduction beyond rigid tiering, recent systems exploit the statistical and structural regularities of LLM activations:

  • Low-rank compression: Pre-RoPE keys in transformers are empirically low rank. ShadowKV exploits this by storing only SVD factors (U,VΣ)(\mathbf U, \mathbf V\boldsymbol\Sigma) for key matrices (KUΣV)(\mathbf K \approx \mathbf U\boldsymbol\Sigma\mathbf V^\top) in GPU memory. Storage drops from O(Sd)O(Sd) to O(Sr+rd)O(Sr + rd) per head (rdr\ll d, e.g., r=160r=160, d=1024d=1024) (Sun et al., 2024).
  • Activation modularization and dynamic sparsity: M2Cache modularizes the FFN into neuron-level units, scoring and ranking them for mixed-precision, dynamically sparse inference, and per-layer caching, thereby reducing both memory and compute (Peng et al., 2024).
  • LoRA and Adapter-based Decomposition: In multi-agent settings with LoRA adapters, caches decompose into a large shared “base” and small per-agent low-rank deltas, further split into “shared-A” low-rank structures for full cross-agent reuse ((Jeon et al., 1 Feb 2026), see Table below).
System Shared Cache Layer(s) Low-rank/Compression Technique
ShadowKV Key (GPU), Value (CPU) SVD (pre-RoPE keys) On-the-fly SVD chunk-wise selection
LRAgent Base + Adapter (per-agent) LoRA: rank-rr deltas Shared base, low-rank per-adapter
M2Cache Per-neuron, per-layer Mixed precision, LRU, ATU Dynamic sparse mixed-prec quant.

These approaches deliver significant memory reduction (6–8× for key caches in ShadowKV; full per-agent cache sharing with <1% accuracy drop in LRAgent) and enable much larger batches or longer context windows at constant hardware cost (Sun et al., 2024, Jeon et al., 1 Feb 2026).

4. Sparse, Dynamic, and Hierarchical Selection Algorithms

Scaling inference while preserving quality under extreme cache compression necessitates informed selection of which KVs (or activations) to fetch or recompute per step:

  • Landmark/Chunk Top-K Selection: ShadowKV divides KVs into non-overlapping chunks and precomputes chunk-level “landmarks” (mean keys). Decoding proceeds by scoring landmarks, selecting the top-KK chunks (and a static set of outlier tokens), then fetching only those for dense attention (see “on-the-fly sparse KV selection” equations) (Sun et al., 2024).
  • Pruning (Static + Dynamic): UniCAIM (hardware) employs static prefill pruning via accumulative attention scores and dynamic per-token top-K selection using content-addressable memory in O(1)O(1) cycles (Xu et al., 10 Apr 2025).
  • Head/Block Approximate Reuse: CLO uses a head-wise similarity (cosine-based) heuristic. Only KV buffers for “hard-to-reuse” heads are retained persistently in HBM; others are fetched on-demand guided by fast GPU-side heuristics (Yi et al., 18 Nov 2025).
  • Layer-wise Fingerprinting: LLMCache applies semantic hash/fingerprint functions (e.g., PCA, SimHash) to input embeddings, enabling rapid sequence-level activation matching at each transformer layer (Bansal, 18 Dec 2025).

These algorithms are typically optimized to overlap control, data movement, and compute across CUDA streams or hardware blocks to minimize the effective latency of sparse fetch/reconstruction against standard attention (Sun et al., 2024, Yi et al., 18 Nov 2025).

5. Systems, Data Movement, and Pipelined Execution

End-to-end throughput is determined not only by which data are kept in each tier, but by how efficiently data are moved and scheduled across devices:

  • Layer/Chunk Overlap: Most systems run the main decode or attention loop and chunked data-movement streams in parallel (e.g., via CUDA multi-streams in ShadowKV and LMCache) (Sun et al., 2024, Cheng et al., 8 Oct 2025).
  • Zero-Copy/DMA I/O: CLO implements a zero-copy memory transfer engine using GDRCopy, allowing AVX-enabled CPU threads to write directly into GPU HBM, saturating PCIe 4.0×16 with only 4 CPU threads (Yi et al., 18 Nov 2025).
  • Prefetch and Persistent Caching: Anticipatory prefetch of “next-needed” chunks/heads, combined with LRU-based persistent caching (e.g., 60% GPU hit in ShadowKV, persistent hard heads in CLO), hides the bulk of I/O latency under compute (Sun et al., 2024, Yi et al., 18 Nov 2025).
  • Synchronization and Scheduling: GPU-centric semaphore/doorbell mechanisms (CLO) decouple the CPU kernel submission from GPU completion, allowing full overlap and eliminating CPU-side launch stalls (Yi et al., 18 Nov 2025). Adaptive in-iteration scheduling (Apt-Serve) dynamically demotes/promotes requests to different cache tiers to maximize batch size under SLO constraints (Gao et al., 10 Apr 2025).

These engineering techniques are critical for sustaining “effective throughput” as measured by tokens/s or queries/s under realistic service-level objectives.

6. Empirical Performance and Scalability

Across platforms and application domains, multi-level cache systems yield substantial gains over single-level KV strategies. Key results include:

  • ShadowKV: Enables up to 6× larger batches and 3.04× throughput on A100; empirically, memory use drops 6×, and accuracy is preserved or improved over infinite-memory baselines (Sun et al., 2024).
  • M2Cache: Achieves up to 10.5× (LLaMA-7B) and 14× (LLaMA-13B) speedup over Zero-Infinity, with up to 7.67× lower carbon emissions per completion; supports full inference of 70B and 40B models on RTX 3090 with a ∼0.5–1% accuracy loss (Peng et al., 2024).
  • LMCache: Throughput improvement by 2.3–14× (single node, multi-round QA); prefill–decode disaggregation yields 1.53–1.84× TTFT reduction; up to 15× overall throughput in large-scale enterprise inference settings (Cheng et al., 8 Oct 2025).
  • CLO: Decoding step latency is 2.1–5.2× lower than RetroInfer; throughput speedups over baseline offload systems range from 9.3% to 66.6%, with CPU overhead under 1% of layer time (Yi et al., 18 Nov 2025).
  • LLMCache: Delivers 2.1–3.1× inference speedup for BERT and GPT-2 with accuracy loss <0.5% (Bansal, 18 Dec 2025).
  • CXL+PNM: CXL-attached LPDDR5X PNM modules boost throughput by 16–22× (1M-token, 8B–70B models) and cut per-token energy up to 60×, with cost efficiency rising up to 7.3× (Kim et al., 31 Oct 2025).
  • Apt-Serve: Hybrid KV/hidden cache with adaptive scheduling yields up to 8.8× higher effective throughput than previous best inference serving systems (Gao et al., 10 Apr 2025).

A broad observation is that hit rates and performance scale with the careful tuning of cache sizes, batch sizes, prefetch windows, and selection heuristics, and—critically—on the effective pipelining of communication and compute.

7. Implementation Practices and Future Directions

Current best practice for multi-level cache architectures in LLM inference combines:

Promising directions include CXL and PNM-enabled offloading architectures (Kim et al., 31 Oct 2025), in-memory compute-integrated caching (UniCAIM (Xu et al., 10 Apr 2025)), LEO-edge distributed caching for ultra-low-latency global inference (Sandholm et al., 20 May 2025), and generalization to multi-tenant, multi-session, multi-agent environments through modular cache composition and per-task sharing (Jeon et al., 1 Feb 2026).

The architecture and engineering of multi-level cache for LLM inference have thus evolved into a foundational area at the intersection of model systems, hardware systems, scheduling, and algorithmic compression. As model scales, sequence lengths, and operational constraints continue to escalate, the design space of multi-level caching will remain a core research and deployment challenge.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Level Cache Architectures for LLM Inference.