Papers
Topics
Authors
Recent
Search
2000 character limit reached

P/D-Serve: Serving Disaggregated Large Language Model at Scale

Published 15 Aug 2024 in cs.DC, cs.CL, and cs.LG | (2408.08147v1)

Abstract: Serving disaggregated LLMs over tens of thousands of xPU devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring the diversity (various prefixes and tidal requests), treating all the prompts in a mixed pool is inadequate. To facilitate the similarity per scenario and minimize the inner mismatch on P/D (prefill and decoding) processing, fine-grained organization is required, dynamically adjusting P/D ratios for better performance. 2) Due to inaccurate estimation on workload (queue status or maintained connections), the global scheduler easily incurs unnecessary timeouts in prefill. 3) Block-fixed device-to-device (D2D) KVCache transfer over cluster-level RDMA (remote direct memory access) fails to achieve desired D2D utilization as expected. To overcome previous problems, this paper proposes an end-to-end system P/D-Serve, complying with the paradigm of MLOps (machine learning operations), which models end-to-end (E2E) P/D performance and enables: 1) fine-grained P/D organization, mapping the service with RoCE (RDMA over converged ethernet) as needed, to facilitate similar processing and dynamic adjustments on P/D ratios; 2) on-demand forwarding upon rejections for idle prefill, decoupling the scheduler from regular inaccurate reports and local queues, to avoid timeouts in prefill; and 3) efficient KVCache transfer via optimized D2D access. P/D-Serve is implemented upon Ascend and MindSpore, has been deployed over tens of thousands of NPUs for more than eight months in commercial use, and further achieves 60\%, 42\% and 46\% improvements on E2E throughput, time-to-first-token (TTFT) SLO (service level objective) and D2D transfer time. As the E2E system with optimizations, P/D-Serve achieves 6.7x increase on throughput, compared with aggregated LLMs.

Citations (3)

Summary

  • The paper introduces P/D-Serve, an end-to-end system that optimizes disaggregated LLM serving across tens of thousands of xPU devices.
  • It employs dynamic P/D ratio adjustments and automated fault management to achieve up to a 6.7x throughput boost and a 42% improvement in TTFT.
  • Efficient D2D KVCache transfers are enabled with block-free design and asynchronous batch processing, minimizing overhead and delays.

P/D-Serve: Serving Disaggregated LLM at Scale

The paper, "P/D-Serve: Serving Disaggregated LLM at Scale" (2408.08147), introduces an innovative end-to-end system named P/D-Serve. The system adresses key challenges associated with serving disaggregated LLMs across expansive infrastructures comprising tens of thousands of xPU devices, including GPUs and NPUs. The paper proposes a comprehensive architecture that addresses the challenges of diversity in prompts, accurate workload estimation, and efficient device-to-device (D2D) KVCache transfers, among others.

Disaggregated LLM Architecture

Autoregressive LLMs: This paper focuses on the architectural nuances of autoregressive LLMs, which leverage the self-attention mechanism, distinguished by their capacity for simultaneously using encoder and decoder components. For instance, models such as "Bert" and "Llama" embody typical encoder-only and decoder-only architectures, respectively. These models generate tokens based on previous outputs, utilizing key-value cache (KVCache) to improve inference by storing intermediate computed data — a necessity given the rapid growth in model size and tally of generated tokens.

Infrastructure Challenges (Figure 1):

  • Physical Resource Constraint: Tens of thousands of xPU devices connected via RDMA, forming an extensive data plane.
  • Disaggregated Architecture: The partitioning of prefill and decoding instances facilitates lower time-to-first-token (TTFT) and improved throughput for subsequent token decoding.
  • KVCache Transfers: With model parallelism (such as tensor parallelism), assessing fine-grained strategies to optimize the D2D KVCache transfer necessitates meticulous management due to disproportionate resource allocation. Figure 1

    Figure 1: Infrastructure contains tens of thousands of xPUs.

Proposed System: P/D-Serve

P/D-Serve is designed to address the inefficiencies of past disaggregated LLM serving systems, offering efficient resource management at a massive scale. The end-to-end P/D-Serve system integrates several key innovations:

Fine-grained P/D Organization: In response to larger diversity in real-time scenarios (Figure 2), P/D-Serve employs a dynamic RODM over Converged Ethernet (RoCE) to enable precise handling of P/D ratios (Figure 3). The ability to dynamically adjust prompts per scenario minimizes mismatches between processing capabilities of prefill and decoding instances, thereby optimizing throughput and achieving TTFT requirements (Figure 4). Figure 2

Figure 2

Figure 2: Various Prompts \protect\ in Services (Scenarios).

Figure 3

Figure 3: Overview of P/D-Serve.

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Mismatch and Bottleneck.

Automated Fault Management (Figure 5): The system employs a custom monitoring mechanism using a Flask service integrated per node, resulting in efficient auto-recovery without disrupting ongoing services — particularly essential in environments with large-scale NPU usage. Figure 5

Figure 5: Automatic Fault Detection.

P/D Efficiency Optimization

P/D Ratio Adjustment:

The optimal synchronization between prefill and decoding is achieved through dynamic P/D adjustment reducing processing mismatch and refining E2E throughput (Figure 6 illustrates the success rate improvement). By maintaining a balance, P/D-Serve significantly boosts the throughput, achieving a 6.7x increase compared to centralized systems, with a 60% enhancement in E2E throughput and up to a 42% improvement in TTFT SLO. Figure 7

Figure 7

Figure 7

Figure 7

Figure 7: Changes on Success Rate.

Figure 6

Figure 6: On-demand Forwarding for Idle Prefill.

Figure 8

Figure 8

Figure 8

Figure 8

Figure 8: Throughput under Ratios.

Efficient KVCache Transfer (Figure 9):

The infrastructure ensures that D2D KVCache transfers occur with minimized overhead. The paper discusses transitioning to block-free transfers using contiguous buffer organization at sender xPU, minimizing software overhead and optimizing RoCE for efficient transfer (Figure 9). Asynchronous retrieval is facilitated with limited local queues in decoding instances, allowing rapid batch job completion without incurring delays due to queued requests. Figure 8

Figure 8

Figure 8

Figure 8

Figure 4: Mismatch and Bottleneck.

Figure 9

Figure 9: Block-free D2D KVCache Transfer.

Conclusion

The P/D-Serve system significantly enhances the efficiency of serving disaggregated LLMs at scale on infrastructures with vast xPU resources. By addressing challenges such as prompt diversity, optimal P/D instance adjustment, accurate workload estimation, and KVCache transfer, the system exhibits substantial improvements in TTFT, throughput, and D2D transfer capabilities in an energetic xPU environment. These contributions extend the capabilities of existing LLM serving systems, enhancing scalability and efficiency. The deployment of P/D-Serve over eight months demonstrates its applicability in large-scale environments, thereby suggesting promising avenues for further research in optimizing LLM serviceability.

Conclusively, P/D-Serve adds a robust and flexible architecture to the spectrum of LLM serving systems, providing substantial improvements in the practical deployment and operation of large-scale LLMs across diverse and high-demand environments. Future research might consider expanding on multimodal models and increasing the role of heterogeneous computing for even more efficient machine learning operations.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.