PDP: Partially Disaggregated Prefill
- PDP is an architectural paradigm that separates LLM inference into isolated prefill and decode phases to balance compute and memory resources.
- It exploits heterogeneity in hardware and dynamic scheduling to minimize inter-phase resource interference, data movement overheads, and storage inefficiencies.
- Implementations across various frameworks demonstrate significant gains in throughput, latency, and cost-effectiveness for large-scale LLM serving.
Partially Disaggregated Prefill (PDP) is an architectural and algorithmic paradigm for LLM inference and serving that lies at the midpoint between fully unified and fully disaggregated systems. PDP aims to isolate and optimize the prefill (or prompt encoding) and decode (autoregressive token generation) stages, exploiting the distinct computational and memory-bound characteristics of each phase, while minimizing inter-phase resource interference, data movement overheads, and inefficiencies in storage and scheduling. PDP is realized in various serving frameworks, hardware architectures, scheduling algorithms, and system deployments, demonstrating substantial gains in throughput, latency, hardware utilization, cost-effectiveness, and flexibility across homogeneous and heterogeneous compute clusters.
1. Definition and Core Principles
PDP refers to the selective disaggregation of the prefill phase from the decode phase in LLM inference, partitioning computation, memory, or data-movement resources to ensure phase isolation, while often unifying storage or minimizing costly inter-GPU transfer.
- In fully unified systems, both prefill and decode share GPU compute and GPU memory for weights and KV-cache, which leads to interference between TTFT (time-to-first-token) and TPOT (time-per-output-token).
- Fully disaggregated systems run prefill and decode on separate physical devices, eliminating interference but causing weight replication, storage imbalance, expensive KV cache transfers, and high-overhead resource partitioning.
- PDP adopts intermediate strategies:
- Semi-PD (Hong et al., 28 Apr 2025): Disaggregate compute resources at the streaming multiprocessor (SM) level, binding SM partitions to prefill/decode workers via NVIDIA MPS, but unify storage (model weights and KV cache) in a single CUDA address space, thereby eliminating KV cache transfer and storage waste.
- Multi-vendor PDP (Chen et al., 22 Sep 2025): Prefill and decode are executed on heterogeneous GPUs (high FLOPS vs. high memory bandwidth) with a compatibility module masking differences and a parallel-strategy joint optimization for deployment.
- Instance-level or temporal PDP (Du et al., 25 Apr 2025): Time-slice instances into long prefill epochs and decode epochs, scheduled cyclically with rolling activation or macro-instance coordination to ensure continuous TTFT availability.
- Chunked and intra-request PDP (Guo et al., 29 Sep 2025): Only encoder and/or prefill phases are disaggregated; decode remains local. Chunking and embedding trackers maximize inter- and intra-request parallelism.
- Partitioned PDP (Liu et al., 22 Sep 2025): Prefill distributed across idle/low-end cluster GPUs for heavy FLOPs, decode centralized on high-end GPUs with minimal KV communication.
- Dispatcher-level PDP (Zhong et al., 18 Jan 2024, Jin et al., 15 Aug 2024): Fine-grained prefill worker pools with adaptive P/D ratios per scenario, optimized RDMA KVCache transfer, on-demand scheduling, and multi-tier cache residency.
2. System Architectures and Implementation Strategies
PDP patterns are instantiated in diverse system architectures:
| System | Computation Disaggregation | Storage Management | Scheduling/Control |
|---|---|---|---|
| semi-PD | SM-level partition (CUDA MPS) | Unified CUDA space; no KV transfer | Dynamic SM controller |
| EcoServe | Time-epoch slicing in instance | Single instance holds both KV/weights | Rolling activation, macro |
| Cronus | Prefill distributed per token | KV gathered to decode pool GPU | Per-request proportional |
| DistServe | Prefill/Decode per-device | Bandwidth-aware placement | Parallelism optimized |
| P/D-Serve | xPU groups per phase/scenario | Bulk RDMA, scenario-wise KV caching | Dynamic P/D ratio, on-pull |
| DynaServe | Request split at optimal s_r | Flexible, adaptive HBM allocation | Two-level (global/local) |
| Mooncake | Prefill/Decode clusters, tiered KVCache | Layer-wise streaming, DRAM/SSD | Early rejection, prediction |
| HydraInfer | ED+P split on multimodal | Overlap via CUDA IPC/NCCL | Stage-level batch |
| TPLA | Aggregated prefill, TP decode | Orthogonal transforms for TP slicing | MLA/TPLA scheduling |
| SPAD | Prefill Chips for partial offload | Fractional layer (α) offloading | Hardware provisioning |
Context:
- On homogeneous clusters, PDP often binds SM or device partitions, or time slices, so both phases have non-interfering compute windows but do not shuffle KV cache or model weights.
- On heterogeneous clusters, PDP exploits the strengths of device classes; e.g., prefill runs on high-FLOPS chips even if VRAM is modest, decode on bandwidth-heavy chips.
- On NPU clusters, PDP organizes per-scenario P/D groups, mapping RDMA RoCE IP tuples to minimize bottlenecks and enable dynamic ratio adjustment (Jin et al., 15 Aug 2024).
Pseudocode examples and algorithms are given for SM-partition switching (Hong et al., 28 Apr 2025), rolling activation (Du et al., 25 Apr 2025), proportional chunk assignment (Liu et al., 22 Sep 2025), and dynamic P/D ratio adjustment (Jin et al., 15 Aug 2024). Algorithms typically employ measurement-driven dynamic adjustment (windowed SLO violation detection, checkpointing, rolling performance metrics).
3. Optimization Objectives and Analytical Models
PDP systems formalize their objectives using throughput, latency, SLO (Service Level Objective) attainment, and resource/cost constraints.
Typical objective functions (examples shown for semi-PD and multi-vendor PDP):
- Maximize request throughput s.t.
Where (x, y) is the SM (or instance or device) split.
- Joint optimization over parallel degrees (dp, tp, pp, ep), instance counts, subject to prompt encoding and decoding latency and memory SLOs:
(same for decode).
Latency and throughput are modeled using empirical or regression fits to token counts, SM fractions, arrival rates (), per-device FLOPS (), and memory bandwidths. Models account for communication time for KV transfer (), and for pipelined overlap regions ().
4. Scheduling, Batching, and Load Balancing
PDP deployments leverage scheduler designs that optimize TTFT, TPOT, and goodput under real workload variability:
- SLO-aware dynamic partitioning (Hong et al., 28 Apr 2025): SM allocation (x, y) updated every window W, delaying switches but never draining state.
- Macro-instance coordination, rolling activation (EcoServe (Du et al., 25 Apr 2025)): ensure at least one instance is prefill-ready, bound TTFT by .
- Chunked pipeline parallelism (Mooncake (Qin et al., 24 Jun 2024), RServe (Guo et al., 29 Sep 2025)): Input chunking, token budgets, and embedding tracker overlap streams encoding, prefill, and decode for both intra- and inter-request optimization.
- Candidate migration / rescheduling (ARES (Wang et al., 15 Oct 2025)): Future token loads predicted via hidden state-driven MLP, requests migrated to minimize token-load variance.
- Stage-level batching in multimodal PDP (HydraInfer (Dong et al., 19 May 2025)): Separate batch sizes for encode, prefill, decode, scheduled per SLO and hardware profile.
Correctness and load balancing are validated via metrics such as SLO attainment (, latencies), migration cost modelling, and load variance reduction.
5. Communication and Storage Efficiency
PDP approaches are characterized by:
- Unified storage with no inter-process KV cache transfer (semi-PD, EcoServe).
- Bulk contiguous-buffer RDMA transfer (P/D-Serve (Jin et al., 15 Aug 2024)), minimizing transfer time by aligning and batching PageAttention KV blocks.
- Tiered KVCache pools (Mooncake (Qin et al., 24 Jun 2024)): VRAM for active blocks, DRAM for warm, SSD for cold with block-level deduplication.
- Transmission-module compatibility (multi-vendor (Chen et al., 22 Sep 2025)): flattening tensors to 1D for heterogeneous GPU transfer, aligning parallel strategies.
Quantitative improvements include 0.1% KV transfer time on OPT-175B (DistServe (Zhong et al., 18 Jan 2024)), up to 46% reduction in D2D KV transfer time (P/D-Serve), and elimination of "second-token penalty" due to standing unified memory maps.
6. Performance Evaluation and Comparative Analysis
Extensive benchmarks confirm the benefits of PDP:
- semi-PD (Hong et al., 28 Apr 2025): End-to-end latency per request reduced by 1.27–2.58× (DeepSeek), throughput increased by 1.55–1.72× (Llama).
- EcoServe (Du et al., 25 Apr 2025): Goodput up to 127% higher than baseline systems on LLMs of 30B, 70B, 32×L20 cluster with commodity Ethernet.
- Cronus (Liu et al., 22 Sep 2025): TTFT99 reduced by 35%, TBT99 improved by 20%, aggregate throughput up 1.7× on heterogeneous clusters.
- DistServe (Zhong et al., 18 Jan 2024): Up to 4.48× higher per-GPU goodput under SLO, 10.2× tighter TPOT SLOs, low KV transfer cost.
- P/D-Serve (Jin et al., 15 Aug 2024): End-to-end throughput up 60%, TTFT SLO success rate up 42%, aggregate throughput up 6.7×.
- DynaServe (Ruan et al., 12 Apr 2025): Goodput improvements of up to 4.34×, balancing HBM memory utilization by 49%, serving capacity up to 3.07× at 100 ms ITL.
- HydraInfer (Dong et al., 19 May 2025): Throughput (>4× over vLLM), TPOT90 reduced from 0.065 s (mono) to 0.038 s (PDP).
- Mooncake (Qin et al., 24 Jun 2024): Throughput gains scale superlinearly with context length, topping 525% at 128k tokens compared to vLLM.
- SPAD hardware (Zhang et al., 9 Oct 2025): Prefill throughput +8% at 52% lower cost, decode 97% performance at 28% lower TDP, cluster savings 19–41% TCO.
A plausible implication is that PDP paradigms generally outperform both colocated and fully disaggregated models in realistic, variable workloads, especially as model and context sizes scale.
7. Limitations, Controversies, and Future Directions
Limitations identified in the literature include:
- PDP scheduling and tuning may incur overhead if parallel degree/global search spaces become large; multi-vendor PDP's search is tractable due to discrete space size (Chen et al., 22 Sep 2025).
- Communication bottlenecks persist in some designs, notably at the prefill-decode boundary (KV transfer), with ongoing proposals for layer-wise pipelined RDMA (Chen et al., 22 Sep 2025).
- Prediction-driven migration (ARES) depends on sufficient hidden state fidelity and is sensitive to batch frequency and migration cost.
- PDP architectures deployed at massive scale (e.g., P/D-Serve across tens of thousands of NPUs) rely on bulk RDMA performance and scenario-aware grouping; model and vendor diversity remains a subject of extended experimentation.
Controversially, the precise balance between compute and memory offload (e.g., the fraction α in SPAD (Zhang et al., 9 Oct 2025)) governs cost and efficiency, with future work suggested to optimize this balance under dynamic workload shifts and hardware heterogeneity.
Future research directions focus on refining PDP's:
- Cost minimization algorithms (explicit cost terms in objective functions).
- By-layer pipelining to further overlap compute and communication.
- Integration with novel attention mechanisms (e.g., PDP separation in TPLA (Tang et al., 21 Aug 2025)).
- Adaptation to domestic hardware ecosystems and new NPU/GPU designs.
- Scaling to ultra-large model deployments and online adjustment under bursty, mixed workloads.
PDP thus represents an important and evolving class of design choices that enable LLM serving systems to match the rapidly diversifying technical landscape and workload demands.