Prefill-Decode Disaggregated Architectures
- Prefill-Decode disaggregated architectures are defined by splitting LLM inference into a compute-intensive prefill stage and a memory-intensive decode stage, enabling phase-specific hardware allocation.
- They employ dynamic scheduling, load balancing, and predictive autoscaling to meet service-level objectives and improve performance under diverse workloads.
- Innovative KVCache management and compression techniques, including RDMA transfers and quantization, significantly reduce communication bottlenecks and enhance overall system efficiency.
Prefill-Decode (P/D) Disaggregated Architectures constitute a leading paradigm for optimizing LLM inference at scale. By decoupling the compute-intensive prefill (prompt/context processing) stage from the memory-intensive decode (tokenwise autoregressive generation) stage, these architectures allow for separate, phase-aware hardware allocation, minimizing resource contention and enabling throughput and latency optimizations subject to diverse service-level objectives (SLOs). The following sections survey the defining principles, architectural variants, resource management strategies, scheduling solutions, and empirical findings from key research threads, with a focus on the most impactful methods, metrics, and trade-offs.
1. Architectural Principles and Variants
P/D disaggregated architectures physically or logically separate the prefill and decode stages of LLM inference, typically assigning each phase to distinct GPU (or accelerator) pools, hardware configurations, or temporal resource partitions. The fundamental rationale is the orthogonality of resource bottlenecks: prefill is compute-FLOPs bound (requiring peak GPU arithmetic throughput), whereas decode is usually memory bandwidth and working set bound, due to the need for holding long KV caches and streaming tokens sequentially.
Major architectural variants include:
Variant | Disaggregation Dimension | Characteristic Features |
---|---|---|
Classical P/D | Resource (node or device-level) | Prefill and decode on separate devices; KVCache transfer via RDMA or NVLink |
Temporal (PaDG, TD-Pipe) | Time (intra-instance time-sharing) | Single device alternates prefill and decode in rolling slots or pipeline phases |
Intra-GPU (Nexus, semi-PD) | Hardware partitioning (SM-level) | Dynamic allocation of GPU SMs or logical cores per phase, decoupling prefill/decode computation within a device |
Multi-stage (EPD, HydraInfer) | Functional (encode, prefill, decode) | Explicit splitting for multimodal models: image/audio encode, textual prefill, token decode, possibly run on separate heterogeneous clusters |
Innovations such as KVCache-centric buffering (Mooncake), block-contiguous D2D KVCache transfer (P/D-Serve, FlowKV), homomorphic KVCache quantization (HACK), and hierarchical scheduling/control (TD-Pipe, Arrow) are tailored to reduce phase interference, resource underutilization, and communication bottlenecks (Qin et al., 24 Jun 2024, Jin et al., 15 Aug 2024, Zhang et al., 5 Feb 2025, Li et al., 3 Apr 2025, Zhang et al., 12 Jun 2025).
2. Phase Decoupling and Resource Management
Successful P/D disaggregation requires tightly coordinated resource management:
- KVCache Handling: Effective reuse, remote retrieval, or cross-node migration of key-value caches is central. Designs such as Mooncake feature disaggregated KVCache that resides in CPU/DRAM/SSD, accessible via GPUDirect-RDMA, supporting cache prefix sharing across requests.
- Fine-grained Partitioning: Partitioning computational or memory resources at sub-device (SM/block/core) granularity enables adaptive response to phase-specific workload (semi-PD, Nexus). For instance, semi-PD partitions streaming multiprocessors between phases using CUDA MPS, supporting asynchronous workers with unified memory pointers and atomic allocation to prevent write-after-read hazards (Hong et al., 28 Apr 2025, Shi et al., 9 Jul 2025).
- Topology-aware Deployment: In large heterogeneous clusters, intelligent partitioning and placement (HexGen-2, HeteroScale) optimize the mapping of prefill and decode groups onto GPUs and network subgroups to match bandwidth and compute requirements and minimize cross-switch contention (Jiang et al., 11 Feb 2025, Li et al., 27 Aug 2025).
3. Scheduling, Load Balancing, and Autoscaling
Dynamic and predictive scheduling is essential to meet latency objectives and maximize throughput in the face of varying workloads and phase imbalances.
- KVCache-aware Scheduling: Schedulers (e.g., Mooncake’s Conductor) analyze cache prefix overlaps, queue delays (), prefill compute times (), and transfer latencies (), selecting instance pairs that minimize time-to-first-token (TTFT) and inter-token decode time (TBT), subject to kvcache_balancing_thresholds:
- Early Rejection and Load Prediction: Imposing SLOs, systems like Mooncake and P/D-Serve employ early rejection by predicting whether a request would cause downstream SLO violation, often using rolling predictions of average decoding time per token () and comparing predicted ratios to SLO thresholds:
If the ratio exceeds threshold, requests are dropped preemptively.
- Decoupled and Elastic Scheduling: Arrow, DynaServe, and HyperFlexis dynamically flip instance roles (prefill/decode), size task/batch pools, and enable real-time resource reallocation under SLOs. Arrow, for instance, treats prefill/decode as per-request attributes and maintains multiple elastic pools, using minimal-load greedy assignment to predict and bound TTFT/TPOT per request (Wu et al., 17 May 2025, Ruan et al., 12 Apr 2025, Yousefijamarani et al., 21 Aug 2025).
- Autoscaling Policies: HeteroScale demonstrates that decode phase tokens-per-second (TPS) is the robust metric for coordinated scaling, leading to significant GPU utilization gains over naive SM or tail-latency signals. The autoscaling logic maintains the P/D pool ratio within empirically determined ranges and uses proportional controllers and topology-aware schedulers to maximize token throughput even in hardware-diverse and network-constrained environments (Li et al., 27 Aug 2025).
4. KVCache Transfer and Compression Strategies
Transmission of KVCache between prefill and decode stages can be a principal bottleneck in P/D architectures, especially under long-context scenarios:
- Contiguous and Block-free KVCache Transfer: Systems such as FlowKV and P/D-Serve restructure KVCache memory layouts to enable merging of layer-wise blocks into large segments, reducing per-request NVLink/NCCL transfer calls by up to and cutting transmission latency by nearly 96% (Li et al., 3 Apr 2025, Jin et al., 15 Aug 2024).
- Pull-based Tensor-Centric RPC: KVDirect eliminates heavy synchronization by letting decode workers “pull” required KV cache blocks in coalesced RDMA reads, thus decoupling memory reservation/provision and reducing per-request latency by up to 55% over push-mode baselines (Chen et al., 13 Dec 2024).
- Lossy and Lossless Compression: HACK applies asymmetric 2-bit quantization to KV data and directly performs homomorphic matrix multiplications—approximate attention—on the quantized data, circumventing costly dequantization. The key approximation is:
This achieves JCT reductions of up to 70.9% over uncompressed baselines (Zhang et al., 5 Feb 2025).
5. Specializations: Multimodal and Heterogeneous Deployments
Adapting P/D architectures for heterogeneous hardware and multimodal input increases system complexity but yields notable performance improvements.
- Multimodal EPD Disaggregation: For large multimodal models, the EPD architecture (HydraInfer, EPD Disaggregation) inserts an explicit “encode” stage (for images/audio/video), allocating it to dedicated GPUs and separating prefill/decode to maximize batch size and minimize memory contention. Caching multimedia tokens, parallelization of intra-request image patch encodings, and black-box resource optimization improve memory efficiency by up to 15× and batch size by up to 22× (Dong et al., 19 May 2025, Singh et al., 25 Dec 2024).
- Heterogeneous GPU Scheduling: HexGen-2 formalizes scheduling as a constraint optimization problem, partitioning the GPU communication topology graph via spectral partitioning and max-flow algorithms to assign heterogeneous GPUs to prefill or decode, considering distinct parallelism strategies (tensor, pipeline) per group (Jiang et al., 11 Feb 2025). HeteroScale’s topology-aware scheduler ensures that prefill and decode groups are assigned to network-affine subgroups (RDMA affinity) for minimal transfer bottlenecks.
6. Comparative Metrics and Empirical Outcomes
Experimental results across the literature consistently report:
System | Throughput Gain | TTFT / TPOT Improvements | Additional Notes |
---|---|---|---|
Mooncake | up to 525% (simulated) | 75% more requests at SLO | Long-context, early rejection, KVCache reuse (Qin et al., 24 Jun 2024) |
P/D-Serve | 6.7× over aggregation | 42% TTFT reduction, 46% transfer | 8+ month prod deployment, fine-group RoCE (Jin et al., 15 Aug 2024) |
FlowKV | 15.2–48.9% over baseline | 96% lower KVCache transfer latency | Heterogeneous GPU support, segment management (Li et al., 3 Apr 2025) |
TD-Pipe | 1.91–2.73× over pipeline | Pipeline bubble minimization | Hierarchical controller, prefill predicting (Zhang et al., 12 Jun 2025) |
Arrow | 5.62–7.78× request rate | TTFT predictively bounded | Stateless instances, elastic pools (Wu et al., 17 May 2025) |
Nexus | 2.2× throughput vs vLLM | 20× TTFT, 2.5× TBT reductions | Intra-GPU SM partitioning, cost model (Shi et al., 9 Jul 2025) |
TaiChi | Up to 77% more goodput | 13.2× TTFT, 1.69× TPOT reductions | Hybrid PD (aggreg/disagg), per-request scheduling (Wang et al., 4 Aug 2025) |
These results demonstrate dramatic improvements in both throughput and SLO adherence, especially under diverse prompt lengths, tidal (burst) request patterns, and heterogeneous infrastructures.
7. Practical Considerations and Trade-offs
Common operational trade-offs in P/D architectures include:
- Communication vs. Interference: While disaggregation eliminates prefill/decode interference, it introduces potentially costly KVCache transfers. Hybrid and partially-disaggregated strategies (EcoServe, semi-PD, TaiChi) attempt to balance these competing demands by using temporal disaggregation or intra-instance resource splits (Du et al., 25 Apr 2025, Hong et al., 28 Apr 2025, Wang et al., 4 Aug 2025).
- Granularity of Partitioning: Finer-grained (SM-level, chunked pipelining, micro-request splitting) approaches allow more dynamic adaptation but increase scheduler complexity and may reduce hardware efficiency at very high concurrency.
- Adaptive and Predictive Scheduling: Early rejection, SLO-aware token budgeting, and dynamic pooling (Arrow, HyperFlexis) are most effective when accurate performance models are available for queueing, memory, and transfer overheads (Wu et al., 17 May 2025, Yousefijamarani et al., 21 Aug 2025). Metric-driven autoscaling policies based on empirically validated signals (decode TPS) are necessary for efficient cloud deployment at scale (Li et al., 27 Aug 2025).
- Storage and Weight Duplication: Disaggregated computation with unified storage (semi-PD) avoids weight replication and storage imbalance, achieving both low overhead and high request capacity (Hong et al., 28 Apr 2025).
P/D Disaggregated Architectures, through phase-aware hardware allocation, fine-grained resource control, predictive and dynamic scheduling, and advanced KVCache transfer and compression techniques, set a robust foundation for scalable, efficient, and SLO-compliant large model inference in modern datacenter deployments. Their continued evolution is characterized by the fusion of algorithmic, systems, and hardware-affinity optimizations, as substantiated by rigorous empirical studies and production-scale deployments.