Phase Disaggregation & Workload Partitioning
- Phase disaggregation is the process of splitting complex workflows into distinct phases to optimize resource utilization and performance.
- Workload partitioning assigns disaggregated phases to heterogeneous resources using deterministic, Bayesian, or adaptive strategies to minimize latency and energy usage.
- Empirical results show that disaggregated architectures can double throughput and significantly cut energy consumption in large-scale AI scenarios.
Phase disaggregation and workload partitioning are foundational concepts in distributed computing and large-scale AI systems. They refer to the systematic decomposition of computational pipelines or workflows into semantically distinct execution phases, followed by the allocation, scheduling, and optimization of these segments across heterogeneous hardware or resources to improve latency, throughput, energy efficiency, or reliability. These techniques have become central in deep learning inference, scientific workflows, and networking, particularly as models and workloads have grown in scale and heterogeneity.
1. Formal Foundations of Phase Disaggregation
Phase disaggregation is the separation of a complex workflow or computational task into disjoint, semantically meaningful stages—each with potentially distinct resource bottlenecks, cost profiles, and optimization criteria. In the context of deep learning inference, the canonical split is between the “prefill” (context or prompt processing) phase and the “decode” (autoregressive generation) phase. The definitions are operationalized as follows (Mitra et al., 5 Jun 2025, Kumar et al., 16 Oct 2025):
- Prefill (context) phase: All input tokens are processed in parallel to construct the entire key-value (KV) cache used for subsequent generation. It is compute-intensive, dominated by dense matrix–matrix multiplications.
- Decode (generation) phase: Tokens are generated one by one in an autoregressive loop, where the KV cache is repeatedly read. This stage is memory-bandwidth-bound, dominated by random access to high-bandwidth memory and low FLOPs per cycle.
This structural segmentation allows independent optimization, specialized placement, and separate scaling of each phase based on its dominant resource demands, enabling both microarchitectural specialization and improved system-level utilization (Kumar et al., 16 Oct 2025, Basit et al., 21 Feb 2026).
In classical distributed workflows, phase disaggregation minimizes not only average completion time but also its variance. Let a workflow be split into independent channels, , with processing times . The completion time has a distribution
where is the standard normal CDF (Huberman et al., 2015, Chua et al., 2015).
2. Workload Partitioning Principles and Algorithms
Workload partitioning refers to the assignment of decomposed phases or subtasks to available resources with the goal of optimizing completion time, jitter, energy, or fairness. The core principle is to exploit heterogeneity and statistical variation across resources and requests, using explicit models to direct allocation.
Deterministic and Distribution-Aware Partitioning: For parallel phases (channels), finding the split vector involves computing the expected makespan and its variance 0 as one-dimensional integrals over the completion time distribution, parameterizing over the unit simplex. The optimal 1 is chosen to trace out the Pareto frontier in the 2 plane (Huberman et al., 2015).
Bayesian Partitioning: When resource performance is uncertain or nonstationary, Bayesian approaches infer not only the means and variances but also scaling exponents for sublinear speedups. The model for each phase’s completion is (Chua et al., 2015):
3
where 4 are learned via hierarchical priors and Gibbs sampling from workflow traces. The optimal split 5 is chosen to minimize expected completion subject to variance or tail constraints.
Dynamic/Adaptive Partitioning: Online systems may implement Refine-and-Prune unsupervised partitioning, as in EWSJF, which clusters live request lengths via k-means and recursively splits or merges queues to yield near-homogeneous cost “phases." The boundaries, as well as the prioritization policies for each group, are tuned by Bayesian meta-optimization (Sidik et al., 29 Jan 2026).
3. Disaggregated System Architectures
Disaggregated architectures physically separate each phase’s execution across independently scalable resource pools or hardware subsets. In LLM inference, the dominant paradigm involves splitting into prefill and decode pools (Mitra et al., 5 Jun 2025, Kumar et al., 16 Oct 2025, Basit et al., 21 Feb 2026):
- Prefill pool: Runs high-throughput, compute-optimized configurations with large batch sizes (e.g., chunked pipeline parallelism). Optimizes for first token latency (FTL/TTFT).
- Decode pool: Adopts memory-optimized, low-latency scheduling, high tensor parallelism, and small batches. Optimized for token-to-token latency (TTL/ITL).
Requests are routed by a front-end load balancer such that prefill outputs, i.e., KV caches, are transferred across a fabric or network, and decode starts as soon as sufficient context is available.
System-level optimization involves performance and bandwidth modeling:
- Prefill throughput per GPU: 6
- Decode throughput: 7
- NVLink/PCIe required bandwidth for KV transfer is modeled precisely to avoid communication bottlenecks (Mitra et al., 5 Jun 2025).
Kernel-Granular Disaggregation: Tessera demonstrates even finer-grained partitioning, mapping individual kernels within a phase (rather than whole phases) to hardware devices best aligned for their operational intensities. Dependency analysis at the PTX level ensures correctness, and workload-aware MILP assignment optimizes for stage time (either throughput- or latency-oriented objectives) (Hu et al., 11 Apr 2026).
4. Scheduling, Resource Matching, and Rate Optimization
Ensuring phase and system efficiency requires dynamic rate matching and intelligent resource scheduling:
- Dynamic Rate Matching: The allocation of GPUs to prefill vs. decode pools is dynamically solved per-latency SLO target, adapting the ctx:gen ratio as workload or SLA requirements shift. Fixed ratios can degrade throughput by up to 20% compared to this dynamic allocation (Mitra et al., 5 Jun 2025).
- Density-Weighted and Urgency/Fairness-Aware Scheduling: EWSJF computes context-aware priority scores per queue, using scheduling utilities (cost/urgency/fairness) that can adapt per phase based on live metrics and feedback-optimized weights via Bayesian meta-optimization. This yields both improved throughput and improved tail latency for latency-sensitive (short prompt) queues (Sidik et al., 29 Jan 2026).
- Energy and DVFS-Aware Allocation: BiScale jointly optimizes placement and per-phase frequency control. At coarse timescales, an ILP solves for the phase-aware GPU/frequency mix. At finer timescales, prefill uses MPC to optimize frequency for energy while maintaining TTFT SLOs, whereas decode phase frequency is adapted per-batch to harvest slack energy, subject to TPOT SLOs (Basit et al., 21 Feb 2026).
In RL post-training (RollMux), phase-level synchronization bubbles are reclaimed via co-execution group abstraction and a two-level scheduler (inter-group placement and intra-group round-robin with provable utilization optimality), ensuring 100% SLO attainment while boosting cost efficiency by 1.84× (Wu et al., 12 Dec 2025).
5. Empirical Results, Benchmarks, and Trade-offs
Systematic experiments across diverse hardware, models, and workloads converge on several key findings (Mitra et al., 5 Jun 2025, Kumar et al., 16 Oct 2025, Hu et al., 11 Apr 2026):
- Effectiveness of Disaggregation: Most effective for prefill-heavy traffic patterns (ISL ≫ OSL) and for large models (810B parameters), yielding up to 2× area improvement in throughput–interactivity Pareto space. Decode-heavy patterns show minimal (or negative) gains.
- Model and Hardware Sensitivity: Larger and more complex models, such as Llama-405B and DeepSeek-R1, exhibit greater benefit as their resource requirements and parallelizability differ markedly across the prefill and decode phases.
- Dynamic Allocation and Adaptation: Adaptive, dynamic rate-matching and queue sizing is essential for retaining Pareto optimality, as statically chosen ctx:gen ratios are often off by up to 20% in non-target regimes.
- Fine-Grained Partitioning: Kernel-level assignment (Tessera) can deliver 2.3× throughput and 1.6× cost efficiency gains over phase/block-level disaggregation, frequently outperforming homogeneous high-end clusters at lower cost (Hu et al., 11 Apr 2026).
- Energy Optimization: Joint phase-aware placement and DVFS (BiScale) yield up to 39% prefill and 48% decode energy savings compared to static or monolithic scheduling, while maintaining strict latency SLOs (Basit et al., 21 Feb 2026).
- RL Post-Training: Avoiding idle bubbles by co-executing disaggregated phases via group-based round-robin achieves 1.84× cost efficiency gain and cuts idle time on specialized GPU clusters by over 24% and 43% for rollout and training clusters, respectively (Wu et al., 12 Dec 2025).
A table of qualitative regime guidance:
| Scenario | Phase Disaggregation Recommended? | Partitioning Notes |
|---|---|---|
| Prefill-heavy traffic, large ISL (>10K) | Yes | Use chunked PP for prefill |
| Large models (>10B params) | Yes | Dynamic rate matching needed |
| Decode-heavy traffic, small models | No | Prefer co-located |
| Limited GPUs, poor NVLink/PCIe fabric | No | Co-located is optimal |
6. Methodological Insights and Best Practices
Methodological advances underlying modern phase disaggregation and workload partitioning include:
- Pareto Frontier Construction: Exhaustively simulating hundreds of thousands of design points (model, partitioning, batch size, allocation) and filtering by SLA violations to construct empirical Pareto frontiers in throughput-interactivity or completion time–variance space (Mitra et al., 5 Jun 2025, Huberman et al., 2015).
- Risk-Aware Partitioning: Selecting workload splits not just for expected makespan but also for tail or variance reduction using the efficient frontier. Empirically, optimal splits can reduce both mean and variance below that of any single-channel allocation (Huberman et al., 2015, Chua et al., 2015).
- Real-Time Adaptation: Bayesian or online learning-based partitioning and scheduling enables the system to track workload drift, heterogeneity, and burstiness, achieving robust and high-efficiency operation even under workload uncertainty (Sidik et al., 29 Jan 2026, Basit et al., 21 Feb 2026).
- Data-Driven Placement: Kernel profiling at offline and per-batch timescales, together with PTX-level dependency extraction, ensures that fine-grained assignment aligns hardware strengths with actual workload bursts and operational intensities (Hu et al., 11 Apr 2026).
7. Application Domains and Broader Significance
These techniques generalize well beyond deep learning and LLM inference. Classical applications include parallel optimization (e.g., distributed logistic regression), multipath file transmission, distributed database operations, and networking (Huberman et al., 2015). In reinforcement learning, phase-level disaggregation combined with structure-aware scheduling enables near-optimal cluster utilization during synchronous post-training (Wu et al., 12 Dec 2025).
The disaggregation paradigm—treating each phase as a microservice, optimizing its assignment, and scaling its resource pool independently—has become the de facto design strategy for large, latency-sensitive, or resource-diverse distributed workloads. It unlocks improved performance, reduced energy, and finer-grained failure isolation. However, it introduces its own scheduling, synchronization, and data-management challenges, requiring principled, data-driven, and adaptive algorithms to realize the potential benefits at scale.