KVCache-centric Buffering: Mooncake

Updated 18 September 2025

KVCache-centric buffering is a strategy that optimizes the storage and reuse of intermediate attention states in Transformer LLMs by decoupling compute and offloading caches to alternative memory resources.
Mooncake demonstrates significant throughput gains by employing a disaggregated architecture with advanced scheduling, admission control, and non-GPU memory utilization to alleviate latency and memory constraints.
Algorithmic techniques such as low-bit quantization, structured pruning, and adaptive token grouping are key to reducing memory footprint while preserving model accuracy in long-context inference.

KVCache-centric buffering refers to a methodological, architectural, and algorithmic focus on optimizing the storage, transfer, eviction, compression, and efficient reuse of key–value (KV) caches in Transformer-based LLM inference. In such systems, the KV cache stores intermediate attention states (key and value tensors) for all processed tokens, enabling fast autoregressive generation but imposing significant memory, latency, and throughput constraints—especially at long context lengths or under workload spikes. The “Mooncake” architecture (Qin et al., 24 Jun 2024) prominently demonstrates a KVCache-centric, disaggregated approach by decoupling compute stages, leveraging non-GPU memory, and introducing sophisticated scheduling and admission control policies. Recent research advances have enumerated algorithmic, systems, and theoretical underpinnings for effective KVCache-centric buffering.

1. Architectural Foundations of KVCache-centric Buffering

A KVCache-centric system's architecture departs from traditional monolithic, GPU-resident inference by explicitly decoupling inference stages (“prefill” and “decoding”) and offloading KVCache to underutilized resources such as CPU DRAM, SSDs, or remote nodes (Qin et al., 24 Jun 2024). This disaggregated cache—composed of paged blocks, indexed by cryptographic hashes of token prefixes for deduplication—enables scalable, elastic KV storage while reducing pressure on GPU memory. Bulk data transfers leverage high-speed interconnects (e.g., GPUDirect RDMA) with a dedicated message service for efficient intra-cluster KV data movement. The architecture typically includes:

Resource Layer	Role in Buffering	Example in Mooncake
GPU Memory	Active decoding KV	Short-term, fast access
CPU DRAM / SSD	Disaggregated KV pool	Long-term, high-capacity
RDMA/SmartNIC	Transfer acceleration	Messenger, FlexiNS (Chen et al., 25 Apr 2025)

This approach offers both cost-effective scaling and fault tolerance in large-serving environments, as evidenced by Mooncake’s deployment in Kimi, where a 525% throughput increase—relative to monolithic baselines—was achieved under long-context, SLO-driven workloads (Qin et al., 24 Jun 2024).

2. Algorithmic Strategies for KVCache Reduction and Compression

Due to the linear growth of the KV cache with context length, modern methods aggressively compress or sparsify the cache while minimizing loss in downstream accuracy. Strategies can be categorized as follows:

a) Low-Bit Quantization:

Techniques such as SKVQ (Duanmu et al., 10 May 2024) perform extreme low-bit KV quantization using group-wise channel reordering and clipped dynamic quantization, selectively preserving recent tokens (“sliding window”) in full precision. MiniKV (Sharma et al., 27 Nov 2024) employs a 2-bit, layer-discriminative approach with heavy-hitter token selection, optimized CUDA kernel fusion, and achieves up to 86% compression ratio and 98.5% accuracy retention.

b) Channel Shrinking and Low-Rank Decomposition:

CSKV (Wang et al., 16 Sep 2024) leverages singular value analysis and low-rank decompositions, arranging the KV matrix into a bi-branch structure: a full-precision recent window and a compressed low-dimensional historical cache. This achieves up to 95% compression with quantization-aware training.

c) Structured and Hierarchical Pruning:

TreeKV (He et al., 9 Jan 2025) uses a tree-structured cache, inspired by wavelet analysis of attention, to implement “sparse to dense” retention; this enables fixed-size caches with smooth hierarchical compression, outperforming position- or attention-based rivals—especially in very long contexts.

PagedEviction (Chitty-Venkata et al., 4 Sep 2025) integrates with paged attention by evicting entire blocks/pages using a proxy importance metric, maintaining high accuracy and cache alignment without kernel modifications.

d) Layer-Personalized Allocation and Cascading Schedulers:

XKV (Li et al., 8 Dec 2024) and CAKE (Qin et al., 16 Mar 2025) frame KV retention as a constrained combinatorial optimization where cache budgets are dynamically distributed across layers, proportional to per-layer importance statistics (e.g., attention entropy or shift variance). CAKE further introduces a temporally robust eviction indicator and a “cascading” O(S) cache management that matches full-cache performance with only 3.2% of the storage.

e) Token Similarity and Redundancy Mining:

KVCrush (Jha et al., 24 Feb 2025) groups tokens by binary “head-behaviour” vectors, selects representatives via fast Hamming clustering, and achieves 4x reduction at <1% accuracy drop, minimally impacting latency.

SpindleKV (Tang et al., 9 Jul 2025) merges similar tokens in shallow layers using codebooks (via cosine similarity and graph-based greeding), and evicts tokens in deep layers using attention-weighted importance, addressing grouped-query attention (GQA) scenarios often inadequately handled by prior art.

f) Adaptive Block-Wise and Streaming Techniques:

RocketKV (Behnam et al., 19 Feb 2025) combines coarse-grain prompt eviction with dynamic, hybrid (per-head and per-sequence) top-k attention, attaining up to 31% peak memory saving at minimal loss.

StreamMem (Yang et al., 21 Aug 2025), in the multimodal domain, implements query-agnostic, fixed-size streaming KV buffering for video understanding, using attention proxies and continuous memory condensation, supporting real-time and memory-constrained QA settings.

3. Scheduling, Admission Control, and Dynamic Buffer Management

A KVCache-centric architecture must mediate between maximizing effective throughput (model FLOPs utilization) and maintaining SLO constraints on latencies (e.g., TTFT, TBT) (Qin et al., 24 Jun 2024). The Mooncake “Conductor” scheduler admits and pairs requests to compute resources based on real-time estimates of cache reuse (prefix match), queuing delay, and transfer time. Early rejection policies, especially under overload, are based on predicted decoding loads (using system-level models that consider token budgets and estimated decode times). This approach reduces wasted prefill computation and flattens load oscillation, directly improving throughput and SLO satisfaction rates.

Complementary approaches (e.g., Cake (Jin et al., 4 Oct 2024)) exploit bidirectional (compute–I/O) prefill scheduling, where chunk-prefill is conducted in parallel from both ends, adaptively merging workloads at points of minimal overall TTFT, further reducing latency bottlenecks under variable hardware conditions.

4. Systems, Metadata, and Buffer Transfer Optimizations

As KVCache disaggregation and remote access become common, transfer and metadata management overheads threaten to become bottlenecks. Practical findings include:

Metadata Patterns: Real-world KVC access traces reveal high temporal and spatial locality, dominated by sequential block accesses—the basis for high hit rates with range queries (Zhu et al., 28 May 2025). Latency increases in lookup or suboptimal metadata indices (as with Redis, or RDMA-based stores not optimized for sequential/range queries) can dominate TTFT.
SmartNIC Optimization: FlexiNS (Chen et al., 25 Apr 2025) offloads the network stack to SmartNICs using header-only transmit, in-cache receive-side processing, and DMA-only notification pipes, outperforming Mooncake's hardware-offloaded and CPU-centric stacks in throughput for bulk KVCache transfers.
Workload-Aware Cache Eviction: Real service traces indicate an exponential decay in KV block reuse tied to specific workload categories (Wang et al., 3 Jun 2025). A workload-aware priority-based policy—combining predicted reuse probability with spatial offset in the sequence—improves cache hit rates (8–24%) and TTFT (up to 42%), substantially more effective than generic LRU/LFU.

5. Multi-expert, Mixed, and Streaming Buffering Scenarios

LLM architectures employing Mixture of Experts (MoE)—e.g., Switch-Transformer—present unique KV buffering burdens. PiKV (Liu et al., 2 Aug 2025) introduces expert-sharded KV storage and per-token adaptive routing to reduce dense cache replication and memory budget, integrating modular compression and scheduling for query-aware eviction. Experimental reductions of 3.9× in per-GPU memory with minimal accuracy loss highlight the effectiveness of coordinated buffer management.

For streaming and multimodal LLMs, frameworks like StreamMem (Yang et al., 21 Aug 2025) achieve query-agnostic, fixed-size KV buffering, leveraging proxy attention and hybrid token merging, facilitating efficient long-range video understanding within tight resource bounds.

6. Empirical Performance, Production Considerations, and Future Directions

Empirical studies consistently emphasize:

Throughput gains (often ranging 2–7× in well-engineered deployments (Qin et al., 24 Jun 2024, Jin et al., 4 Oct 2024)).
Memory reductions by factors of 4× to 95% compression (Sharma et al., 27 Nov 2024, Wang et al., 16 Sep 2024, Jha et al., 24 Feb 2025).
Model accuracy typically remains within 1–5% of full-precision baselines, with per-task degradation more acute for specific QA/summarization workloads (Gao et al., 31 Mar 2025).
Latency trade-offs, including the risk of increased response length and per-token overhead in excessively compressed settings.
Tooling for request routing that integrates throughput and length predictors, and negative sample evaluation, is recommended for safe, adaptive deployment contexts (Gao et al., 31 Mar 2025).

Much ongoing work addresses the integration of compression, quantization, structured page/block management, and workload prediction into transparent, adaptive serving frameworks, as well as the extension to hierarchical, hardware-adaptive, and multimodal/multi-expert inference pipelines.

7. Summary Table: Principal Techniques in KVCache-centric Buffering

Strategy	Memory Impact	Accuracy Impact	System Integration	Notable Refs
Quantization	2–10× ↓	<5% loss	CUDA kernel fusion, paged	(Duanmu et al., 10 May 2024, Sharma et al., 27 Nov 2024)
Channel Shrink	4–5× ↓	minimal	Bi-branch, QAT-ready	(Wang et al., 16 Sep 2024)
Structured Evict	O(1) budget	task-dependent	PagedAlignment, block aware	(Chitty-Venkata et al., 4 Sep 2025, He et al., 9 Jan 2025)
Layer Alloc	2–5× ↓	minimal	Cascading, global schedule	(Li et al., 8 Dec 2024, Qin et al., 16 Mar 2025)
Token Grouping	~4× ↓	<1% loss	Page compatible, real-time	(Jha et al., 24 Feb 2025)
Speculative Load	~10× ↓ VRAM	<2% loss	CPU offload, top-k prefetch	(Jie et al., 20 Mar 2025)
MoE Sharding	3–4× ↓/GPU	<1.5% loss	Expert-shard, route, compress	(Liu et al., 2 Aug 2025)

These methodologies, when combined within a KVCache-centric serving and buffering stack, enable LLMs to scale to million-token contexts and multi-instance deployments, while adhering to stringent latency and cost constraints. Underpinning all recent progress is a recognition that KVCache management is not merely an implementation detail, but a core axis for architectural and algorithmic co-design in modern large-scale LLM systems.