Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 188 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 78 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Mooncake Architecture for LLM Inference

Updated 15 October 2025
  • Mooncake Architecture is a disaggregated LLM serving system centered on a global KVCache that separates prefill and decoding clusters for optimized inference.
  • It utilizes underused CPU, DRAM, and SSD resources with an RDMA-based data transfer mechanism to minimize latency and expand cache capacity.
  • Its KVCache-aware scheduler and prediction-based overload handling maximize throughput and ensure strict adherence to Service Level Objectives.

Mooncake Architecture is a KVCache-centric, disaggregated serving system for LLMs, developed for Moonshot AI's Kimi service. It is distinguished by its architectural separation of prefill and decoding clusters, its novel use of underutilized CPU, DRAM, and SSD resources to construct a global, disaggregated KVCache, and a KVCache-aware scheduler designed for throughput maximization under strict Service Level Objectives (SLOs). Mooncake is engineered for both long-context input efficiency and robust overload handling, resulting in significant performance improvements over contemporary methods.

1. KVCache-centric Disaggregated Serving Pipeline

Mooncake centers its design around the management and reuse of KVCache—the key/value pairs output by the prefill stage of a Transformer model's inference. The system explicitly separates the model serving workflow into two stages: the prefill stage (prompt encoding and initial KVCache generation) and the decoding stage (autoregressive token generation leveraging cached results). Dedicated GPU clusters serve each stage independently. This separation enables:

  • Prefill clusters to optimize for cache reuse, avoiding redundant computation when possible
  • Decoding clusters to aggregate token processing for model FLOPs utilization (MFU), optimizing “time between tokens” (TBT)
  • Independent scaling of prefill and decoding clusters, permitting fine-grained resource allocation responsive to differing latency requirements: “time to first token” (TTFT) for prefill versus TBT for decoding

This model avoids contention and enables specialized scheduling, as each cluster is tuned for distinct workload characteristics.

2. Distributed KVCache Resource Utilization

Mooncake leverages idle resources in GPU clusters—specifically CPU cores, DRAM, and SSD storage—to form a disaggregated global KVCache pool. Architectural features include:

  • KVCache blocks are stored in a paged format within CPU memory (DRAM), dramatically expanding cache capacity past GPU limits
  • An RDMA-based “Messenger” component enables high-speed data transfer between CPU/SSD caches and GPU nodes, minimizing cache fetch latency
  • KVCache can be served from multiple storage hierarchies: nearby DRAM, SSD-backed storage, or direct GPU memory

This design facilitates efficient reuse of intermediate results, reducing compute by avoiding redundant prompt processing, especially important in long-context inference scenarios.

3. KVCache-centric Scheduler and Load Balancing

The Conductor scheduler in Mooncake pairs prefill and decoding instances for each incoming request. Its operational workflow involves:

  • Prefix match length computation: For each prompt, token block hashes are compared against cached entries to estimate how much computation can be reused
  • Latency estimation:
    • TqueueT_{\text{queue}}: queue waiting time
    • TprefillT_{\text{prefill}}: prefill stage execution time, dependent on input length and cache match
    • TtransferT_{\text{transfer}}: time needed to fetch missing cache blocks
  • Optimization objective: Minimize TTFT while ensuring no SLO violation in TTFT or TBT
  • Proactive cache migration: If a high cache match instance is available, KVCache blocks are migrated to optimize load balancing

This enables deterministic scheduling under multiple latency constraints, maximizing resource utilization and overall system throughput.

4. Overload Handling and Early Rejection Strategies

Mooncake addresses real-world overload by integrating a prediction-based early rejection policy:

  • Requests are evaluated for acceptance or rejection before prefill execution, based on predicted load satisfaction for both prefill and decoding clusters
  • The scheduler forecasts decoding-stage capacity using system-level prediction (e.g., uniform per-token decode time, tdt_d), which prevents the wasteful prefill computation for requests unlikely to meet TBT SLOs
  • The nuanced prediction-based rejection smooths out anti-phase fluctuations between cluster loads, improving effective throughput and reducing wasted computation

In overload simulations, improved early rejection policies led to up to 600 fewer wasted requests versus baseline systems.

5. Performance Evaluation and Throughput Gains

Mooncake demonstrates substantial throughput and latency improvements:

  • Up to a 525% throughput improvement relative to baseline methods (e.g., vLLM) in simulated long-context scenarios
  • On public datasets such as ArXiv Summarization and L-Eval, Mooncake-[3P+1D] configurations were 20–40% superior in throughput to 4-instance vLLM while maintaining SLOs
  • With actual Kimi production traces, Mooncake processed approximately 75% more requests than baseline, meeting nearly 100% of TBT SLOs (versus 57% with baseline)
  • Load prediction mechanisms in overload mode improved request handling, sharply reducing wasted requests and enabling sustained high effective throughput

The following table summarizes comparative results:

Scenario Mooncake Throughput Gain SLO Adherence
Simulated Long-context 525% over baseline Satisfies TTFT/TBT
End-to-end on public data 20–40% over vLLM Both TTFT/TBT met
Real production traces 75% more requests ~100% TBT met

6. Application in Large-scale LLM Deployment

Mooncake powers the Kimi LLM service at Moonshot AI, demonstrating its suitability for large-scale, production-grade deployments:

  • Independent cluster scaling and disaggregated cache avoid node overload, even with long or high-frequency user prompts
  • Scheduler-driven cache reuse minimizes redundant computations under strict latency budgets
  • Utilization of extant CPU, DRAM, and SSD resources decreases deployment cost and enhances scalability
  • Prediction-based overload handling enables graceful system degradation, ensuring service reliability and high SLO satisfaction during peak periods

This suggests Mooncake’s architectural approach is conducive to both cost-effective scaling and robust, latency-sensitive LLM inference serving.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Mooncake Architecture.