Disaggregated Inference in LLMs
- Disaggregated inference is a paradigm that separates LLM processing into distinct prefill and decode phases to optimize resource allocation and performance.
- It employs phase-specific hardware, dynamic scheduling, and elastic scaling to enhance throughput and reduce latency in inference operations.
- The approach mitigates compute-memory contention and enables efficient deployments across heterogeneous, cost-sensitive environments.
Disaggregated inference refers to a paradigm in machine learning systems—especially LLMs—where distinct phases of inference (typically prefill and decode) are isolated onto separate hardware, processes, or architectural components. This separation is driven by the diverging resource requirements and performance bottlenecks of each phase when scaling inference across modern data center and edge deployments. Disaggregated inference is now a foundational design in efficient LLM serving, supporting heterogeneous clusters, cost-efficiency, specialized accelerators, and sophisticated software scheduling techniques.
1. Motivation and Foundational Principles
The motivation for disaggregated inference begins with the observation that LLM inference is inherently multi-phase. In particular, transformer-based autoregressive models have:
- Prefill phase: Processes the user prompt, computes all-attention activations for the entire input, populates the KV cache, and generates the first output token. This stage is heavily compute-bound and benefits from high peak arithmetic throughput but has moderate memory requirements.
- Decode phase: Generates one token at a time, retrieving and appending to the KV cache. This process is serial, memory-bound, and highly sensitive to bandwidth and latency, but typically less demanding in FLOP/s than prefill.
Traditional monolithic deployments, where both phases run on the same hardware, suffer inefficiencies:
- Computation/memory resource contention between phases
- Hardware underutilization, as modern GPUs/TPUs cannot be simultaneously optimal for compute- and memory-bound workloads
- Poor scaling and elevated cost as model and workload scale increases
Disaggregation—separating prefill and decode across different hardware pools or processes—addresses these by enabling phase-specialized hardware utilization, elastic scaling, better queueing, and avoidance of cross-phase interference (Zhang et al., 9 Oct 2025, Jiang et al., 11 Feb 2025).
2. System Architectures and Techniques
2.1. Phase Separation Models
- P/D Disaggregation (Prefill/Decode): The canonical approach is to deploy independent worker pools for prefill and decode, with explicit KV cache (or hidden state) transfer between them.
- Partial Disaggregation: Cronus introduces partially disaggregated prefill for heterogeneous clusters: splitting the prefill phase at a layer boundary, assigning early (light) layers to weaker GPUs, late (heavy) layers to stronger GPUs, then passing intermediate activations between nodes (Liu et al., 22 Sep 2025).
- Temporal Disaggregation: TD-Pipe temporally separates prefill and decode across the entire pipeline, using intelligent switching so that bulk prefill operations are grouped, then followed by decode, removing pipeline "bubbles" and maximizing hardware utilization on commodity (PCIe) clusters (Zhang et al., 12 Jun 2025).
- Component Disaggregation in MoE: In mixture-of-experts architectures, disaggregation may occur at the module level, e.g., separating attention from FFN (expert) blocks, each assigned to tailored hardware and with custom parallelism (Zhu et al., 3 Apr 2025).
2.2. Hardware Specialization
SPAD demonstrates that separate, phase-specialized hardware can yield significant cost and energy efficiency (Zhang et al., 9 Oct 2025):
- Prefill Chips: Large systolic arrays, GDDR7 memory (cheaper than HBM), minimized non-tensor logic, focused on maximizing throughput for short, compute-intensive bursts.
- Decode Chips: Minimized compute (fewer/lower-width matrix engines), maximized HBM3 bandwidth/capacity, ideal for high-concurrency, memory-bound sequential decode.
2.3. Heterogeneity and Transmission Modules
As LLM clusters trend toward heterogeneity (multi-vendor GPUs, mixed generations), disaggregated systems must:
- Handle cross-architecture data transfer (precision differences, VRAM block management)
- Use RDMA or custom communication libraries for high-throughput, low-latency KV cache or activation transfer (Chen et al., 22 Sep 2025, Jiang et al., 11 Feb 2025).
- Provide parallel-strategy reconciliation (e.g., differing tensor parallelism between prefill and decode pools).
3. Scheduling, Load Balancing, and Elastic Scaling
Effective disaggregated inference requires advanced scheduling and resource allocation mechanisms due to nonstationary traffic and fluctuating request characteristics:
- Dynamic Layer Partitioning: Cronus dynamically determines the prefill-layer split point per request, continuously balancing queue depth, device speed, and expected tail latency (Liu et al., 22 Sep 2025).
- Adaptive Instance Sizing: Arrow replaces static prefill/decode ratios with elastic, stateless instance pools and SLO-aware real-time scheduling, using direct TTFT/TPOT measurements for feedback (Wu et al., 17 May 2025).
- Workload-Aware Autoscaling: HeteroScale coordinates P/D pool scaling across heterogeneous clusters using decode TPS as the robust metric, maintaining architectural balance and maximizing utilization under fluctuating load (Li et al., 27 Aug 2025).
- Temporal and Batch Scheduling: TD-Pipe applies phase-aware, AI-driven switching and greedy batching to pack prefills/decode efficiently given memory and latency constraints (Zhang et al., 12 Jun 2025).
- Dynamic Resource Scheduling: HexGen-2 and similar systems cast the cross-cluster disaggregation and scheduling problem as NP-hard optimization—using graph partitioning and max-flow—to match resource, network, and parallelism constraints (Jiang et al., 11 Feb 2025).
4. KV Cache Transfer and Communication Optimizations
A critical system bottleneck is efficient communication of KV cache and activations between disaggregated phases:
- Tensor-Centric Transmission: KVDirect introduces metadata-driven, RDMA-based, pull-mode cache transfer, minimizing CPU/GPU synchronization and enabling dynamic node allocation; this achieves 22+ GB/s effective bandwidth, reducing per-request latency by up to 55% versus baselines (Chen et al., 13 Dec 2024).
- Cache Alignment and Aggregation: FlowKV reshapes fragmented memory and aligns sender/receiver blocks to enable segmental (O(1)) kernel launches rather than O(n) fine-grained ones, achieving up to 96% reduction in transfer times (Li et al., 3 Apr 2025).
- Homomorphic KV Computation: HACK proposes performing attention matrix multiplications directly on quantized (2-bit) data, removing all decode dequantization burdens and reducing JCT by up to 70.9% over baseline quantization (Zhang et al., 5 Feb 2025).
- Distributed/Elastic Memory and Caching: MemServe provides MemPool, a distributed, elastic KV cache enabling context caching across requests and disaggregated nodes, maximizing transfer efficiency via huge-page aggregation and cluster-level prompt tree scheduling (Hu et al., 25 Jun 2024).
5. Empirical Performance, Cost, and Practical Impact
Quantitative experiments in recent works highlight the real-world advantages and trade-offs:
- Latency: Cronus cuts TTFT P99 in half over DP/PP, achieves up to 1.36× throughput improvement vs. disaggregated prefill (Liu et al., 22 Sep 2025). FlowKV and KVDirect reduce KV transfer to negligible latency overhead (0.045s instead of >1s) (Li et al., 3 Apr 2025, Chen et al., 13 Dec 2024). HACK reduces JCT by 41–71% vs. best-prior quantization (Zhang et al., 5 Feb 2025).
- Throughput: Arrow achieves up to 7.78× higher request rate over static disaggregated baselines on production traces (Wu et al., 17 May 2025). MegaScale-Infer provides 1.9× throughput improvement per GPU for large-scale MoE (Zhu et al., 3 Apr 2025).
- Cost/Power: SPAD reduces end-to-end hardware cost by 19–41% and TDP by up to 17% at equivalent performance compared to current datacenter GPU clusters (Zhang et al., 9 Oct 2025). HexGen-2 attains 2× throughput and 1.5× lower latency at the same or 30% lower price budget relative to state-of-the-art (Jiang et al., 11 Feb 2025).
- Robustness and Adaptivity: HeteroScale increases average multi-thousand-GPU utilization by 26.6 points, with no SLO breaches under extreme traffic (Li et al., 27 Aug 2025).
- Heterogeneous Support: Multi-vendor, cross-generation GPU support is demonstrated with cross-precision and tensor block alignment modules (Chen et al., 22 Sep 2025), and autoscaling is robust to continually shifting GPU/NIC pools (Li et al., 27 Aug 2025).
6. Challenges, Trade-offs, and Effective Regimes
6.1. Complexity and Coordination
- The search space for optimal disaggregated allocation is combinatorially large: partitioning, hardware mapping, batch sizing, parallelism selection, and bandwidth matching must be optimized jointly (Mitra et al., 5 Jun 2025).
- System-level coordination is nontrivial: KV cache must be transferred efficiently, SLOs must be met for both prefill (first-token latency, FTL) and decode (per-token latency, TTL), with independent queueing and resource allocation.
- Disaggregation may introduce new failure modes or overhead when traffic is decode-heavy, when KV cache transfer bandwidth is insufficient, or in batch/generation regimes where phase demands are not well separated (Mitra et al., 5 Jun 2025).
6.2. Actionable Deployment Insights
| Scenario | Disaggregation Value |
|---|---|
| Large LLMs (≥10B params), Prefill-heavy | Strong throughput/interactivity gain (up to 2×); recommended (Mitra et al., 5 Jun 2025) |
| Heterogeneous GPU clusters | Resource and cost efficiency; enables multi-vendor deployments (Chen et al., 22 Sep 2025) |
| Dynamic or bursty traffic | Requires elastic scaling and adaptive scheduling (Wu et al., 17 May 2025, Li et al., 27 Aug 2025) |
| Context-caching/multi-turn conversation | Demands unified memory pool and prompt-aware routing (Hu et al., 25 Jun 2024) |
| Small models or decode-dominant traffic | Limited gains; colocated or piggybacking may be preferable |
The optimal phase/hardware allocation must track workload dynamics, and static ratios or naive scaling schemes quickly degrade away from the Pareto frontier.
7. Broader Implications and Future Directions
Disaggregated inference is reshaping how LLM and deep learning inference systems are architected and deployed at scale. Implications include:
- Hardware/Software Co-Design: Workload-aware, phase-specific accelerator design (SPAD), matched with advanced system software, can replace general-purpose, overprovisioned GPUs/TPUs.
- Heterogeneity as First Principle: LLM serving ecosystems are moving toward native support for cross-vendor, cross-generation hardware, driven by cost and supply chain realities (Chen et al., 22 Sep 2025, Jiang et al., 11 Feb 2025).
- Unified State Management: Distributed, elastic memory/KV caching and context management platforms (e.g., MemServe) are essential for stateful, sessionful, and multi-user serving.
- Statistical Optimization: Both scheduling and system tuning increasingly leverage optimization, empirical modeling, and predictive control, integrating reinforcement learning, integer programming, and graph/network algorithms.
Disaggregated inference is now a defining architectural and algorithmic challenge in large-scale deep learning deployment, and remains an active area of systems, algorithms, and hardware research (Liu et al., 22 Sep 2025, Mitra et al., 5 Jun 2025, Zhang et al., 9 Oct 2025, Li et al., 27 Aug 2025, Jiang et al., 11 Feb 2025, Zhu et al., 3 Apr 2025, Chen et al., 13 Dec 2024, Hu et al., 25 Jun 2024).