Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (2407.00079v3)

Published 24 Jun 2024 in cs.DC, cs.AI, and cs.AR

Abstract: Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.

Authors (7)

Ruoyu Qin (3 papers)
Zheming Li (2 papers)
Weiran He (8 papers)
Mingxing Zhang (10 papers)
Yongwei Wu (5 papers)
Weimin Zheng (11 papers)
Xinran Xu (11 papers)

Citations (16)

View on Semantic Scholar

Summary

Mooncake: Kimi's KVCache-centric Architecture for LLM Serving

The paper, "Mooncake: Kimi's KVCache-centric Architecture for LLM Serving," presents a novel disaggregated architecture aimed at enhancing the serving capabilities of LLMs. This architecture, known as Mooncake, addresses the intricate challenges posed by a diversified workload environment, focusing particularly on optimizing throughput while adhering to Service Level Objectives (SLOs).

Architectural Overview

Mooncake is built around a core KVCache-centric scheduler that efficiently manages the computational resources in GPU clusters. The architecture segregates the prefill and decoding stages into distinct clusters while leveraging underutilized resources such as CPU, DRAM, and SSD. This segregation is driven by the varying computational requirements of the prefill (highly parallelizable) and decoding (sequential) stages. By redistributing these resources, Mooncake enhances both computational and memory utilization without incurring significant additional costs.

Central to Mooncake is the KVCache-centric scheduling mechanism, which prioritizes reusing previously computed KVCache entries to minimize redundant computations. This design significantly impacts system throughput, notably reducing TTFT and TBT, thus directly improving user experience.

Performance Metrics and Results

Experimental results validate Mooncake's efficacy. When benchmarked against state-of-the-art systems like vLLM, Mooncake demonstrated a throughput increase of up to 525% in simulated long-context scenarios, maintaining compliance with SLOs. Under real-world workloads, Mooncake enabled Kimi to handle 75% more requests while ensuring that SLOs were met. Such improvements are attributed to the KVCache-centric scheduler's ability to balance load and efficiently manage cache reuse.

Improving KVCache Utilization

A unique aspect of Mooncake's architecture is the employment of a predictive early rejection policy. In heavily overloaded scenarios, this policy predicts future loads and preemptively rejects requests that are unlikely to meet SLOs, thereby conserving computational resources. This approach effectively mitigates the pitfalls of conventional load-based rejection, which often results in unnecessary fluctuations in system load and degraded performance. The proposed strategy ensures a more balanced and efficient system, optimizing the overall resource utilization.

Theoretical and Practical Implications

From a theoretical viewpoint, Mooncake introduces a significant shift in how disaggregated architectures can be utilized for LLM serving. The KVCache-centric design not only enhances throughput but also offers a blueprint for future systems aiming to balance computational and memory-bound operations across heterogeneous resources. Practically, the adoption of Mooncake can deeply influence the operational strategies of MaaS providers, enabling them to offer more robust services without the prohibitively high incremental costs typically associated with scaling GPU-based infrastructure.

Future Directions

Looking forward, the paper outlines potential pathways for further improvements. These include exploring heterogeneous accelerators to optimize memory-bound and computationally intensive operations, and the integration of advanced KVCache management algorithms such as ZipCache and hybrid architectures that consolidate memory usage. Investigating algorithmic advancements that reduce the KVCache footprint can also yield significant efficiency gains, particularly in handling long-context requests.

Additionally, Mooncake’s architecture can benefit from more advanced scheduling policies that incorporate varying request priorities and dynamic instance balancing. Exploring methodologies to effectively utilize idle resources through tasks like batch-oriented offloading during periods of low demand could further enhance system efficiency and resilience.

Conclusion

Overall, the architecture proposed in "Mooncake: Kimi's KVCache-centric Architecture for LLM Serving" serves as a substantial contribution to the field of LLM serving technologies. Its innovative design not only tackles the existing challenges in handling diversified workloads but also sets the stage for scalable and efficient MaaS operations. The empirical results and theoretical foundations laid by Mooncake will likely inspire future research and developments, driving continued advancements in LLM serving infrastructure.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/WWVY/status/1808622017437999521

https://twitter.com/yeounoh/status/1930302934971261259

HackerNews

MoonshotAI unveils Kimi's large-scale LLM serving architecture (18 points, 1 comment)