PD Disaggregation in LLM Inference
- PD Disaggregation is a framework that splits the prefill (prompt encoding) and decode (token generation) stages to allow independent optimization and scaling.
- It incorporates targeted techniques like stage-aware block pruning and token-aware KV cache pruning, achieving up to 20% latency improvements and nearly 5× bandwidth savings.
- This methodology underpins multi-stage scheduling in systems ranging from LLM serving to energy and spatial modeling, ensuring high SLO compliance through dynamic load balancing.
Prefill-Decode Disaggregation (PD Disaggregation) is a technical concept that denotes the explicit architectural and computational separation of the "prefill" (prompt encoding) and "decode" (token-by-token generation) stages in large-scale model inference or data analysis. Originating in the context of LLM serving, the principle has broad applicability, appearing as a methodological and optimization motif in energy disaggregation, spatial population/property modeling, and high-throughput online inference. In all cases, "disaggregation" refers to the decomposition of a global, aggregate process into stage- or component-specific workflows, which can then be independently optimized, computed, or interpreted.
1. Core Principles of PD Disaggregation
PD Disaggregation in LLM inference draws a rigorous line between two computationally and resource-wise distinct stages:
- Prefill Stage ("Producer"): Processes the entire input sequence in parallel, computes deep representations, and constructs the Key-Value (KV) cache that encodes the context. This stage is compute-bound, favoring devices with high FLOPs and massive batch GEMM throughput.
- Decode Stage ("Consumer"): Autoregressively generates one token at a time, repeatedly referencing the prefabricated KV cache. This stage is memory-bandwidth-bound and is bottlenecked by cache reads.
Disaggregation assigns these stages to specialized compute nodes (GPUs or clusters), thereby allowing independent batching, scaling, and scheduling strategies tailored to the resource constraints and workload dynamics of each stage. By removing mutual interference—where prefill’s compute peaks disrupt decode’s need for low-latency memory bandwidth—PD disaggregation achieves superior throughput, lower tail latencies, and high SLO compliance, especially for high-concurrency regimes (Liao et al., 26 Nov 2025, Wang et al., 4 Aug 2025, Liu et al., 1 Dec 2025, Zhang et al., 29 Aug 2025).
2. Disaggregation in LLM Systems: Architectures and Scheduling
Recent LLM serving pipelines implement PD Disaggregation as a multi-pool GPU architecture:
| Stage | Primary Resource Demand | Batched | Representative System |
|---|---|---|---|
| Prefill (P) | Compute-bound (GEMM) | Yes | DOPD, DistServe, Trinity |
| Decode (D) | Memory-bandwidth (KV reads) | Yes | DOPD, TaiChi, Trinity |
| Vector-Search (VS) | HBM (ANN search, RAG) | Yes | Trinity |
Trinity (Liu et al., 1 Dec 2025) further disaggregates vector search (retrieval for RAG or answer caches) into its own GPU pool, optimizing bandwidth utilization and request tail-latency by staging retrieval requests as a first-class, stage-preemptible service.
Schedulers in PD-disaggregated systems incorporate:
- Dynamic P/D Ratio Adjustment: Adaptive rebalancing of prefill and decode instance counts based on real-time or forecasted request mix (average input/output length, concurrency), analytically derived so that for maximal goodput (Liao et al., 26 Nov 2025).
- Length-aware Request Batching and Scheduling: Knapsack heuristics or cost models to batch requests with similar input lengths for prefill, or token counts for decode, ensuring balanced resource usage and short queueing (Wang et al., 4 Aug 2025).
- Stage-aware Preemption/Prioritization: EDF and FIFO scheduling within multi-stage resource pools (prefill-priority, decode-priority), with requests admitted or flushed to ensure latency SLOs for both TTFT and TPOT (Liu et al., 1 Dec 2025).
3. Optimizations in PD-Disaggregated Inference
Advanced PD Disaggregation systems incorporate model-level and system-level optimizations:
- Stage-aware Block Pruning: The selective removal (or distillation) of Transformer blocks in a way that is sensitive to the error propagation characteristics of each stage. Prefill must retain high representational fidelity to avoid compounding errors in downstream token generation; decode tolerates more aggressive pruning without significant quality loss (Zhang et al., 29 Aug 2025).
- Token-aware KV Cache Pruning: During decode, only transferring and reusing the necessary portions of the KV cache—primarily the first and last tokens in selected deep layers—results in dramatic reductions in inter-node bandwidth (up to 4.95× reduction) with negligible output degradation (Zhang et al., 29 Aug 2025).
- Continuous Vector Search Batching: For RAG and retrieval, continuously batching vector queries at the granularity of graph-node 'extends', maximizing HBM utilization and avoiding GPU tail latency outliers—integral in PD-disaggregated LLM serving (Liu et al., 1 Dec 2025).
Summary of performance gains from targeted pruning within PD disaggregation (Zhang et al., 29 Aug 2025):
| Method | Latency (ms) | Bandwidth (GB) |
|---|---|---|
| Dense | 287.35 | 4.0 |
| Ours (PD + Prune) | 228.63 (~20% faster) | 0.8 (~5× reduction) |
4. Analytical and Performance Models
PD Disaggregation is mathematically modeled via queueing and throughput equations:
- Production and Consumption Rates: (prefill), (decode).
- Optimal P/D Ratio: , with system scaled such that backpressure and queueing are minimized (Liao et al., 26 Nov 2025).
- Goodput under SLOs: Defined as the fraction of requests meeting both TTFT and TPOT targets, optimized via dynamic P/D scaling and fine-grained latency shifting (Wang et al., 4 Aug 2025, Liao et al., 26 Nov 2025).
Highly dynamic scheduling (e.g., DOPD’s ARIMA-driven adjustment) achieves >99% SLO attainment while reducing over-provisioned resources (Liao et al., 26 Nov 2025).
5. Application Domains: Disaggregation Beyond LLMs
While PD Disaggregation has become a central organizing principle in high-throughput LLM serving, its core methodology—partitioning aggregate signals or processes to recover stage/component-level information—manifests in several research areas:
- Energy Disaggregation: Recovery of device- or source-specific consumption from aggregated building loads using stage-wise models and advanced inference (FIR-adaptive filtering (Dong et al., 2013), PSO-based unsupervised segmentation (Brucke et al., 2020), structured dictionary learning (Pandey et al., 2019)).
- Spatial Disaggregation: Bayesian models for recovering high-resolution spatial fields (e.g., population density (Rahman et al., 2023), property value (Archbold et al., 2023), spatial misalignment (Suen et al., 14 Feb 2025), survey-based proportions (Benedetti et al., 2021)) from aggregate or incomplete observations, leveraging disaggregation to propagate uncertainty and enable local inference.
- Power Flow Disaggregation: Unsupervised statistical decomposition of grid-level measurements to recover photovoltaic contributions and demand by leveraging temporal and spectral model separation (Sossan et al., 2017).
- Preference Disaggregation: Reconstruction of additive value functions from aggregate or partial preference information, as in Multi-Criteria Decision Making, by solving reverse-optimization problems (Brunelli et al., 16 Oct 2024).
- Biomedical Subtype Disaggregation: Partitioning heterogeneous diseases (e.g., Parkinson's disease) into molecular or progression-based subpopulations using hierarchical Bayesian clustering or trajectory profile clustering on longitudinal data (Burghardt et al., 2023, Krishnagopal et al., 2019).
6. Trade-offs, Limitations, and Research Directions
PD Disaggregation introduces trade-offs:
- Accuracy vs. Efficiency: Aggressive block or KV cache pruning yields compute/bandwidth savings at the cost of model accuracy. The “elbow point” for pruning hyperparameters is determined via iterative calibration (Zhang et al., 29 Aug 2025).
- Dynamic vs. Static Topologies: Most PD disaggregation architectures and pruning strategies assume static stage assignments. Extending these to dynamic, multi-tenant, or heterogeneous environments remains non-trivial (Zhang et al., 29 Aug 2025).
- Memory Management: Dynamic subgraph loading and fragmentation must be addressed for pruned/disaggregated models at scale (Zhang et al., 29 Aug 2025).
- SLO Regimes: PD Disaggregation excels for strict streaming (TPOT) SLOs with relaxed TTFT, while aggregation can be superior for ultra-tight TTFT and lax streaming, motivating hybrid or unified architectures (e.g., TaiChi, which interpolates between extremes) (Wang et al., 4 Aug 2025).
Extensions include:
- Mixture-of-Experts Pruning: Per-stage expert selection for MoE architectures.
- Vector-Search Pool Disaggregation: Dedicated RAG GPU pools decoupled from P/D serving.
- Joint Optimization: Simultaneous tuning of block pruning, cache pruning, and quantization across a multi-stage pipeline.
7. Impact and Benchmark Results
PD Disaggregation and its associated optimizations have achieved significant empirical improvements:
| System / Method | Throughput or Goodput Improvement | TTFT Reduction | TPOT Reduction | Bandwidth Savings |
|---|---|---|---|---|
| DOPD, Trinity | 1.5×–1.77× over aggregation | 67.5% | 22.8%–50% | 4.95× |
| Targeted Pruning | >20% latency savings | – | – | 4.95× |
| PD Disaggregation vs Unified | +2.19 pts avg accuracy (8-task mean, LLaMA-3.1-8B) | – | – | – |
Benchmark datasets span LLaMA2-13B, Qwen2.5-7B/14B (13 benchmarks, e.g., MMLU, ARC, HellaSwag) (Zhang et al., 29 Aug 2025). Integration with advanced schedulers and vector search scaling yields further throughput and tail-latency improvements (Trinity: 30% QPS, 35% P95 latency, 25% TTFT P95) (Liu et al., 1 Dec 2025).
PD Disaggregation has become a foundational architecture and modeling strategy across modern distributed inference and signal decomposition, enabling scalable, stage-sensitive optimization in applications ranging from LLMs to energy systems, spatial analytics, and biomedical data science (Liao et al., 26 Nov 2025, Zhang et al., 29 Aug 2025, Wang et al., 4 Aug 2025, Liu et al., 1 Dec 2025, Brucke et al., 2020, Sossan et al., 2017, Rahman et al., 2023, Archbold et al., 2023, Brunelli et al., 16 Oct 2024, Benedetti et al., 2021, Suen et al., 14 Feb 2025).