Mooncake Architecture for LLM Inference
- Mooncake Architecture is a disaggregated LLM serving system centered on a global KVCache that separates prefill and decoding clusters for optimized inference.
- It utilizes underused CPU, DRAM, and SSD resources with an RDMA-based data transfer mechanism to minimize latency and expand cache capacity.
- Its KVCache-aware scheduler and prediction-based overload handling maximize throughput and ensure strict adherence to Service Level Objectives.
Mooncake Architecture is a KVCache-centric, disaggregated serving system for LLMs, developed for Moonshot AI's Kimi service. It is distinguished by its architectural separation of prefill and decoding clusters, its novel use of underutilized CPU, DRAM, and SSD resources to construct a global, disaggregated KVCache, and a KVCache-aware scheduler designed for throughput maximization under strict Service Level Objectives (SLOs). Mooncake is engineered for both long-context input efficiency and robust overload handling, resulting in significant performance improvements over contemporary methods.
1. KVCache-centric Disaggregated Serving Pipeline
Mooncake centers its design around the management and reuse of KVCache—the key/value pairs output by the prefill stage of a Transformer model's inference. The system explicitly separates the model serving workflow into two stages: the prefill stage (prompt encoding and initial KVCache generation) and the decoding stage (autoregressive token generation leveraging cached results). Dedicated GPU clusters serve each stage independently. This separation enables:
- Prefill clusters to optimize for cache reuse, avoiding redundant computation when possible
- Decoding clusters to aggregate token processing for model FLOPs utilization (MFU), optimizing “time between tokens” (TBT)
- Independent scaling of prefill and decoding clusters, permitting fine-grained resource allocation responsive to differing latency requirements: “time to first token” (TTFT) for prefill versus TBT for decoding
This model avoids contention and enables specialized scheduling, as each cluster is tuned for distinct workload characteristics.
2. Distributed KVCache Resource Utilization
Mooncake leverages idle resources in GPU clusters—specifically CPU cores, DRAM, and SSD storage—to form a disaggregated global KVCache pool. Architectural features include:
- KVCache blocks are stored in a paged format within CPU memory (DRAM), dramatically expanding cache capacity past GPU limits
- An RDMA-based “Messenger” component enables high-speed data transfer between CPU/SSD caches and GPU nodes, minimizing cache fetch latency
- KVCache can be served from multiple storage hierarchies: nearby DRAM, SSD-backed storage, or direct GPU memory
This design facilitates efficient reuse of intermediate results, reducing compute by avoiding redundant prompt processing, especially important in long-context inference scenarios.
3. KVCache-centric Scheduler and Load Balancing
The Conductor scheduler in Mooncake pairs prefill and decoding instances for each incoming request. Its operational workflow involves:
- Prefix match length computation: For each prompt, token block hashes are compared against cached entries to estimate how much computation can be reused
- Latency estimation:
- : queue waiting time
- : prefill stage execution time, dependent on input length and cache match
- : time needed to fetch missing cache blocks
- Optimization objective: Minimize TTFT while ensuring no SLO violation in TTFT or TBT
- Proactive cache migration: If a high cache match instance is available, KVCache blocks are migrated to optimize load balancing
This enables deterministic scheduling under multiple latency constraints, maximizing resource utilization and overall system throughput.
4. Overload Handling and Early Rejection Strategies
Mooncake addresses real-world overload by integrating a prediction-based early rejection policy:
- Requests are evaluated for acceptance or rejection before prefill execution, based on predicted load satisfaction for both prefill and decoding clusters
- The scheduler forecasts decoding-stage capacity using system-level prediction (e.g., uniform per-token decode time, ), which prevents the wasteful prefill computation for requests unlikely to meet TBT SLOs
- The nuanced prediction-based rejection smooths out anti-phase fluctuations between cluster loads, improving effective throughput and reducing wasted computation
In overload simulations, improved early rejection policies led to up to 600 fewer wasted requests versus baseline systems.
5. Performance Evaluation and Throughput Gains
Mooncake demonstrates substantial throughput and latency improvements:
- Up to a 525% throughput improvement relative to baseline methods (e.g., vLLM) in simulated long-context scenarios
- On public datasets such as ArXiv Summarization and L-Eval, Mooncake-[3P+1D] configurations were 20–40% superior in throughput to 4-instance vLLM while maintaining SLOs
- With actual Kimi production traces, Mooncake processed approximately 75% more requests than baseline, meeting nearly 100% of TBT SLOs (versus 57% with baseline)
- Load prediction mechanisms in overload mode improved request handling, sharply reducing wasted requests and enabling sustained high effective throughput
The following table summarizes comparative results:
Scenario | Mooncake Throughput Gain | SLO Adherence |
---|---|---|
Simulated Long-context | 525% over baseline | Satisfies TTFT/TBT |
End-to-end on public data | 20–40% over vLLM | Both TTFT/TBT met |
Real production traces | 75% more requests | ~100% TBT met |
6. Application in Large-scale LLM Deployment
Mooncake powers the Kimi LLM service at Moonshot AI, demonstrating its suitability for large-scale, production-grade deployments:
- Independent cluster scaling and disaggregated cache avoid node overload, even with long or high-frequency user prompts
- Scheduler-driven cache reuse minimizes redundant computations under strict latency budgets
- Utilization of extant CPU, DRAM, and SSD resources decreases deployment cost and enhances scalability
- Prediction-based overload handling enables graceful system degradation, ensuring service reliability and high SLO satisfaction during peak periods
This suggests Mooncake’s architectural approach is conducive to both cost-effective scaling and robust, latency-sensitive LLM inference serving.