An Effort to Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage
The paper "semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage" addresses critical challenges in the serving systems for LLMs. It identifies an inherent inefficiency in existing approaches, which fall into two main categories: unified systems and disaggregated systems. This research presents semi-PD, a novel serving framework, as a solution that combines the benefits of disaggregated computation with the advantages of unified storage.
Existing Systems and Their Limitations
Traditional LLM serving systems are either unified, where all processing phases share the same GPU resources, or disaggregated, where phases are separated across different GPUs. Unified systems suffer from latency interference between phases due to resource sharing, negatively impacting throughput. In contrast, disaggregated systems resolve this interference by assigning separate GPUs to different phases, improving phase-specific scheduling but introducing storage inefficiencies, notably with duplicated weights and KV cache transfer overheads.
These storage inefficiencies lead to poor performance under high request rates. The paper extensively analyzes such drawbacks, including storage imbalance, unnecessary cache transfer latency, and coarse-grained resource adjustments. Each inefficiency results in suboptimal GPU memory utilization, hindering deployment flexibility and increasing the likelihood of latency spikes, particularly under demanding workloads.
The Proposed semi-PD Framework
The semi-PD system adopts a hybrid approach. It leverages disaggregated computation to eliminate latency interference while utilizing unified storage to enhance memory use efficiency. The computational aspect is achieved by partitioning computational resources at the streaming multi-processor (SM) level, enabling asynchronous execution. This is complemented by a unified memory manager that efficiently coordinates access to model weights and KV caches.
One of the core contributions of semi-PD is its lightweight resource adjustment mechanism, which allows dynamic adaptation to workload variations. Unlike traditional disaggregated systems that require GPU-level reorganization, semi-PD achieves finer granularity control. It employs MPS (Multi-Process Service) for resource partitioning between prefill and decode workers, ensuring that the transition between different SM ratios incurs minimal overhead.
Performance Results
The evaluation indicates that semi-PD significantly improves latency across various benchmarks and models. For instance, semi-PD achieves 1.27-2.58× reduction in average latency for DeepSeek series models and serves 1.55-1.72× more requests while maintaining latency constraints on Llama series models compared to state-of-the-art systems.
Figures illustrating P90 and P99 TTFT and TPOT reveal that semi-PD consistently exhibits lower latencies even as request rates increase. By integrating a dynamic adjustment algorithm sensitive to Service-Level Objectives (SLOs), semi-PD optimizes resource partitioning, further improving throughput while adhering to latency constraints.
Implications and Future Directions
The implications of this work are multi-fold. On a practical level, semi-PD provides a scalable solution for deploying LLMs in environments with fluctuating workloads, effectively mitigating latency interference through a lightweight and efficient system design. Theoretically, it challenges the bifurcation of existing LLM serving architectures, suggesting that hybrid models may offer better flexibility and performance.
Future research could explore further integration of semi-PD within larger, distributed systems, assessing its impact on cloud deployment efficiency. Additionally, investigating opportunities to optimize scheduling algorithms in dynamic workloads could yield even better performance metrics. As the demand for LLM services grows, frameworks like semi-PD are foundational in supporting robust, high-efficiency serving solutions in diverse deployment scenarios.