Mooncake Architecture for LLM Inference

Updated 15 October 2025

Mooncake Architecture is a disaggregated LLM serving system centered on a global KVCache that separates prefill and decoding clusters for optimized inference.
It utilizes underused CPU, DRAM, and SSD resources with an RDMA-based data transfer mechanism to minimize latency and expand cache capacity.
Its KVCache-aware scheduler and prediction-based overload handling maximize throughput and ensure strict adherence to Service Level Objectives.

Mooncake Architecture is a KVCache-centric, disaggregated serving system for LLMs, developed for Moonshot AI's Kimi service. It is distinguished by its architectural separation of prefill and decoding clusters, its novel use of underutilized CPU, DRAM, and SSD resources to construct a global, disaggregated KVCache, and a KVCache-aware scheduler designed for throughput maximization under strict Service Level Objectives (SLOs). Mooncake is engineered for both long-context input efficiency and robust overload handling, resulting in significant performance improvements over contemporary methods.

1. KVCache-centric Disaggregated Serving Pipeline

Mooncake centers its design around the management and reuse of KVCache—the key/value pairs output by the prefill stage of a Transformer model's inference. The system explicitly separates the model serving workflow into two stages: the prefill stage (prompt encoding and initial KVCache generation) and the decoding stage (autoregressive token generation leveraging cached results). Dedicated GPU clusters serve each stage independently. This separation enables:

Prefill clusters to optimize for cache reuse, avoiding redundant computation when possible
Decoding clusters to aggregate token processing for model FLOPs utilization (MFU), optimizing “time between tokens” (TBT)
Independent scaling of prefill and decoding clusters, permitting fine-grained resource allocation responsive to differing latency requirements: “time to first token” (TTFT) for prefill versus TBT for decoding

This model avoids contention and enables specialized scheduling, as each cluster is tuned for distinct workload characteristics.

2. Distributed KVCache Resource Utilization

Mooncake leverages idle resources in GPU clusters—specifically CPU cores, DRAM, and SSD storage—to form a disaggregated global KVCache pool. Architectural features include:

KVCache blocks are stored in a paged format within CPU memory (DRAM), dramatically expanding cache capacity past GPU limits
An RDMA-based “Messenger” component enables high-speed data transfer between CPU/SSD caches and GPU nodes, minimizing cache fetch latency
KVCache can be served from multiple storage hierarchies: nearby DRAM, SSD-backed storage, or direct GPU memory

This design facilitates efficient reuse of intermediate results, reducing compute by avoiding redundant prompt processing, especially important in long-context inference scenarios.

3. KVCache-centric Scheduler and Load Balancing

The Conductor scheduler in Mooncake pairs prefill and decoding instances for each incoming request. Its operational workflow involves:

Prefix match length computation: For each prompt, token block hashes are compared against cached entries to estimate how much computation can be reused
Latency estimation:
- $T_{\text{queue}}$ : queue waiting time
- $T_{\text{prefill}}$ : prefill stage execution time, dependent on input length and cache match
- $T_{\text{transfer}}$ : time needed to fetch missing cache blocks
Optimization objective: Minimize TTFT while ensuring no SLO violation in TTFT or TBT
Proactive cache migration: If a high cache match instance is available, KVCache blocks are migrated to optimize load balancing

This enables deterministic scheduling under multiple latency constraints, maximizing resource utilization and overall system throughput.

4. Overload Handling and Early Rejection Strategies

Mooncake addresses real-world overload by integrating a prediction-based early rejection policy:

Requests are evaluated for acceptance or rejection before prefill execution, based on predicted load satisfaction for both prefill and decoding clusters
The scheduler forecasts decoding-stage capacity using system-level prediction (e.g., uniform per-token decode time, $t_d$ ), which prevents the wasteful prefill computation for requests unlikely to meet TBT SLOs
The nuanced prediction-based rejection smooths out anti-phase fluctuations between cluster loads, improving effective throughput and reducing wasted computation

In overload simulations, improved early rejection policies led to up to 600 fewer wasted requests versus baseline systems.

5. Performance Evaluation and Throughput Gains

Mooncake demonstrates substantial throughput and latency improvements:

Up to a 525% throughput improvement relative to baseline methods (e.g., vLLM) in simulated long-context scenarios
On public datasets such as ArXiv Summarization and L-Eval, Mooncake-[3P+1D] configurations were 20–40% superior in throughput to 4-instance vLLM while maintaining SLOs
With actual Kimi production traces, Mooncake processed approximately 75% more requests than baseline, meeting nearly 100% of TBT SLOs (versus 57% with baseline)
Load prediction mechanisms in overload mode improved request handling, sharply reducing wasted requests and enabling sustained high effective throughput

The following table summarizes comparative results:

Scenario	Mooncake Throughput Gain	SLO Adherence
Simulated Long-context	525% over baseline	Satisfies TTFT/TBT
End-to-end on public data	20–40% over vLLM	Both TTFT/TBT met
Real production traces	75% more requests	~100% TBT met

6. Application in Large-scale LLM Deployment

Mooncake powers the Kimi LLM service at Moonshot AI, demonstrating its suitability for large-scale, production-grade deployments:

Independent cluster scaling and disaggregated cache avoid node overload, even with long or high-frequency user prompts
Scheduler-driven cache reuse minimizes redundant computations under strict latency budgets
Utilization of extant CPU, DRAM, and SSD resources decreases deployment cost and enhances scalability
Prediction-based overload handling enables graceful system degradation, ensuring service reliability and high SLO satisfaction during peak periods

This suggests Mooncake’s architectural approach is conducive to both cost-effective scaling and robust, latency-sensitive LLM inference serving.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Mooncake Architecture.