P/D-Serve: Scalable LLM Inference Architecture
- P/D-Serve is a disaggregated serving architecture for large language models that separates inference into prefill and decoding phases.
- It employs a rejection-based, pull scheduling protocol and dynamic pool scaling to manage heterogeneous resource utilization and reduce queue delays.
- Its contiguous, block-free KVCache transfers with RDMA orchestration cut latency by 46% and significantly boost throughput.
The P/D-Serve system is an end-to-end serving architecture and scheduling methodology for large-scale disaggregated LLM inference. At its core, P/D-Serve decomposes inference requests into two distinct phases—prefill (P) and decoding (D)—and coordinates their execution across pools of heterogeneous accelerator devices to optimize throughput, latency (notably time-to-first-token, TTFT), and resource utilization. The paradigm spans systems-level implementation, queueing-theoretic modeling, RDMA-based memory orchestration, and fine-grained control/scheduling for large cloud-scale deployments (Jin et al., 15 Aug 2024).
1. Architectural Decomposition of LLM Inference
P/D-Serve formalizes the strict division of the LLM inference pipeline:
- Prefill Phase (P): Processes the input prompt, runs the full transformer stack, computes the initial token, and constructs the complete prefix of the KV cache. Its principal service metric is TTFT (time-to-first-token).
- Decoding Phase (D): Sequentially consumes the generated KV cache to produce one token at each iteration, sensitive to memory I/O and cache locality. The critical metric is TPOT (time per output token).
In P/D-Serve, these phases are mapped to separate pools of stateless containers, each managing a set of xPU (GPU or NPU) instances with explicit, dynamic allocation (Jin et al., 15 Aug 2024). The infrastructure employs RoCE v2 for high-speed, low-latency device-to-device (D2D) communication, with precompiled models resident on scalable storage. The system is managed by a highly available MLOps control plane (Kubernetes, ZooKeeper, automated scaling, and fault detection).
The P/D ratio—number of prefill () to decoding () instances—can be analytically and empirically tuned to balance pipeline stages. Equation (1) expresses the condition for balanced throughput:
where , are batch sizes, and , are the average service times per batch in each phase (Jin et al., 15 Aug 2024).
2. Scheduling, On-Demand Forwarding, and Request Handling
P/D-Serve abandons reliance on local queue state or slow hardware utilization metrics for prefill load balancing, as empirical measurements reveal these are poorly correlated with actual queueing delay and can result in request timeouts, especially under bursty and heterogeneous traffic.
Instead, the gateway maintains the state of all SSE streaming connections. For each new inference request, it selects the prefill worker with the fewest active SSE streams (interpreted as least loaded) and attempts to place the request. If a worker is busy (rejects), the gateway immediately tries the next ranked candidate, proceeding until a timeout threshold is reached. If a worker accepts, the request is guaranteed to be queued in a hardware batch; otherwise, failure is signaled immediately—this is a rejection-based, pull scheduling protocol (Jin et al., 15 Aug 2024).
This approach yields two primary effects:
- Minimizes prefill queuing by only dispatching to idle or lightly loaded workers, maximizing hardware utilization.
- Drastically reduces TTFT SLO violations, with experimental rates dropping from 57% success (local-queue baseline) to >99% under workload surges (Jin et al., 15 Aug 2024).
Pseudocode for gateway behavior:
1 2 3 4 5 6 7 8 9 |
def serve_request(req, timeout): candidates = rank_by_SSE_connections(prefill_group) start = now() for p in candidates: if p.try_accept(req): return p.process(req) if now() - start > timeout: return fail_timeout() return fail_timeout() |
3. Device-to-Device KVCache Transfers
A central performance challenge in disaggregated LLM serving is the transfer of the KV cache (intermediate key/value tensors) from the prefill to decoding stage. Prior systems use block-fixed, RDMA-based KVCache transfer—invoking multiple remote memory copy operations per request. This approach suffers from high control overhead and tail latency variability due to block-by-block scheduling under interference.
P/D-Serve introduces contiguous, block-free KVCache transfer: the sender lays out all required key/value tensors for a request in a single contiguous memory buffer, performs a single RDMA write, and invokes a RecvScatter operation at the receiver to partition the buffer into layer- or block-aligned segments. This reduces per-request D2D transfer cost:
where is the KV cache size, is link bandwidth, and is the minimal handshake overhead (Jin et al., 15 Aug 2024). Experimental results indicate a 46% reduction in average D2D latency and substantial stabilization of tail latencies.
4. Performance Modeling and Adaptive Resource Management
P/D-Serve employs analytical and empirical modeling of both end-to-end and per-phase system performance. The system adapts , pools and their associated batch sizes, informed by:
- Offline profiling: To derive initial P/D ratios for diverse prompt and scenario classes.
- Online SLO monitoring: Feedback-driven tuning to maintain target TTFT and overall E2E latency.
- Dynamic RoCE endpoint remapping: To reallocate pools without service interruption in response to load shifts.
Performance is mathematically determined by the bottleneck between prefill and decoding stage throughput (Jin et al., 15 Aug 2024):
where is the incoming request rate. The system uses this formulation both for steady-state sizing and real-time auto-scaling.
5. Implementation: Software Stack and Orchestration
P/D-Serve is implemented atop the Ascend NPU platform and MindSpore deep learning framework. Notable engineering features include:
- Device Orchestration: Volcano and Ascend Device Plugin schedule NPU assignment to stateless containers.
- Network and Storage Layer: hccn CLI/API for RoCE endpoint enumeration; scalable file service or SSD for pre-compiled model binaries.
- Control Plane: ZooKeeper channels enable dynamic P/D pool formation and endpoint re-assignment; Flask API-driven fault detection.
- Model Compilation: MindSpore convert tool pre-compiles both P and D model binaries for rapid hot-swapping/recovery.
- Cluster Networking: NPUs are directly cabled via RoCE v2 to high-throughput switching fabric, supporting hundreds of racks and tens of thousands of devices (Jin et al., 15 Aug 2024).
Key challenges addressed include bounding HBM metadata footprint for RoCE QPs, enabling seamless insertion/removal of container endpoints in active P/D pools, and supporting cross-region disaster recovery.
6. Experimental Validation and Production Experience
P/D-Serve has been deployed for over eight months at production scale on Huawei's Ascend/MindSpore platforms, supporting the Pangu model family across up to 10,000+ NPUs. Experimental outcomes include:
- Throughput: 6.7× higher versus aggregated LLM serving; 60% higher than first-generation disaggregated baselines.
- TTFT Violations: Reduction from 57% SLO success to >99% (absolute 42% improvement) under stress testing.
- KVCache Latency: 46% reduction in D2D transfer time, with tail-variance of ~5ms.
- Composite Gains: 60% higher E2E throughput, 42% fewer TTFT violations, and 46% lower D2D transfer time. Gains are confirmed on both mirror clusters and live production systems (Jin et al., 15 Aug 2024).
Summary table:
| Metric | Baseline | P/D-Serve | Δ |
|---|---|---|---|
| Throughput (requests/s) | 1.0 | 1.6 | +60% |
| TTFT SLO Success (%) | 57% | 99% | +42pp |
| D2D KVCache Time (ms) | 100 | 54 | –46% |
| E2E vs. Aggregated | 1.0 | 6.7 | ×6.7 |
7. Broader Implications and Applicability
P/D-Serve demonstrates that mere partition of prefill and decode phases in LLM serving is insufficient for high scale. Its contributions—fine-grained, scenario-aware resource grouping, dynamic balancing, rejection-based scheduling, and contiguous RDMA transfers—enable stable, predictable performance and resource efficiency even under non-stationary, skewed traffic.
The design patterns established by P/D-Serve—including analytic modeling of bottlenecks, phase separation, and decoupled, pull-based scheduling—are broadly applicable for future large-scale machine learning inference, especially in environments requiring flexible orchestration of heterogeneous computational resources (Jin et al., 15 Aug 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free