Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage (2504.19867v1)

Published 28 Apr 2025 in cs.CL, cs.DC, and cs.LG

Abstract: Existing LLM serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache. Such storage inefficiency delivers poor serving performance under high request rates. In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i.e., partitioning the computational resource to enable the asynchronous computation of two phases. Thus, we propose a novel LLM serving system, semi-PD, characterized by disaggregated computation and unified storage. In semi-PD, we introduce a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases. semi-PD has a low-overhead resource adjustment mechanism between the two phases, and a service-level objective (SLO) aware dynamic partitioning algorithm to optimize the SLO attainment. Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x on DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency constraints on Llama series models.

An Effort to Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

The paper "semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage" addresses critical challenges in the serving systems for LLMs. It identifies an inherent inefficiency in existing approaches, which fall into two main categories: unified systems and disaggregated systems. This research presents semi-PD, a novel serving framework, as a solution that combines the benefits of disaggregated computation with the advantages of unified storage.

Existing Systems and Their Limitations

Traditional LLM serving systems are either unified, where all processing phases share the same GPU resources, or disaggregated, where phases are separated across different GPUs. Unified systems suffer from latency interference between phases due to resource sharing, negatively impacting throughput. In contrast, disaggregated systems resolve this interference by assigning separate GPUs to different phases, improving phase-specific scheduling but introducing storage inefficiencies, notably with duplicated weights and KV cache transfer overheads.

These storage inefficiencies lead to poor performance under high request rates. The paper extensively analyzes such drawbacks, including storage imbalance, unnecessary cache transfer latency, and coarse-grained resource adjustments. Each inefficiency results in suboptimal GPU memory utilization, hindering deployment flexibility and increasing the likelihood of latency spikes, particularly under demanding workloads.

The Proposed semi-PD Framework

The semi-PD system adopts a hybrid approach. It leverages disaggregated computation to eliminate latency interference while utilizing unified storage to enhance memory use efficiency. The computational aspect is achieved by partitioning computational resources at the streaming multi-processor (SM) level, enabling asynchronous execution. This is complemented by a unified memory manager that efficiently coordinates access to model weights and KV caches.

One of the core contributions of semi-PD is its lightweight resource adjustment mechanism, which allows dynamic adaptation to workload variations. Unlike traditional disaggregated systems that require GPU-level reorganization, semi-PD achieves finer granularity control. It employs MPS (Multi-Process Service) for resource partitioning between prefill and decode workers, ensuring that the transition between different SM ratios incurs minimal overhead.

Performance Results

The evaluation indicates that semi-PD significantly improves latency across various benchmarks and models. For instance, semi-PD achieves 1.27-2.58× reduction in average latency for DeepSeek series models and serves 1.55-1.72× more requests while maintaining latency constraints on Llama series models compared to state-of-the-art systems.

Figures illustrating P90 and P99 TTFT and TPOT reveal that semi-PD consistently exhibits lower latencies even as request rates increase. By integrating a dynamic adjustment algorithm sensitive to Service-Level Objectives (SLOs), semi-PD optimizes resource partitioning, further improving throughput while adhering to latency constraints.

Implications and Future Directions

The implications of this work are multi-fold. On a practical level, semi-PD provides a scalable solution for deploying LLMs in environments with fluctuating workloads, effectively mitigating latency interference through a lightweight and efficient system design. Theoretically, it challenges the bifurcation of existing LLM serving architectures, suggesting that hybrid models may offer better flexibility and performance.

Future research could explore further integration of semi-PD within larger, distributed systems, assessing its impact on cloud deployment efficiency. Additionally, investigating opportunities to optimize scheduling algorithms in dynamic workloads could yield even better performance metrics. As the demand for LLM services grows, frameworks like semi-PD are foundational in supporting robust, high-efficiency serving solutions in diverse deployment scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Ke Hong (8 papers)
  2. Lufang Chen (2 papers)
  3. Zhong Wang (105 papers)
  4. Xiuhong Li (14 papers)
  5. Qiuli Mao (3 papers)
  6. Jianping Ma (2 papers)
  7. Chao Xiong (5 papers)
  8. Guanyu Wu (2 papers)
  9. Buhe Han (1 paper)
  10. Guohao Dai (51 papers)
  11. Yun Liang (42 papers)
  12. Yu Wang (939 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com