Disaggregated Serving Systems

Updated 3 April 2026

Disaggregated serving systems are architectures that logically separate compute, memory, storage, and network resources to optimize SLOs and resource utilization.
They employ pooling, phase-oriented separation, and dynamic scheduling to mitigate resource skew and interference in high-throughput workloads.
Empirical results show significant improvements, including up to 9.8× latency reduction and enhanced cost-efficiency in LLM inference and recommendation systems.

A disaggregated serving system is an architectural paradigm in which heterogeneous compute, memory, and other resources are logically and often physically separated, then orchestrated together using highly optimized scheduling and data movement mechanisms. The principal motivation is to address resource skew, phase interference, and inflexibility inherent in monolithic and tightly coupled systems—allowing independent scaling, improved resource utilization, and strict adherence to service-level objectives (SLOs) for complex, high-throughput workloads such as LLM inference, recommendation serving, and multimodal pipelines. Modern disaggregated serving systems underpin production-scale AI, cloud, and telecom infrastructures, employing dynamic orchestration, rack-scale shared memory, profile-based simulation, and fine-grained migration mechanisms.

1. Foundational Concepts and Motivation

Disaggregated serving systems decompose monolithic server designs into distinct resource pools—compute nodes (CPUs, GPUs, NPUs), memory nodes (DRAM, CXL-attached memory), storage nodes, and network components—connected by high-speed interconnects or fabrics. Inference pipelines are further decomposed along phase or operator boundaries (e.g., prefill, decoding, and expert stages for LLMs). This decoupling allows each resource type to be managed, scheduled, and scaled independently to match workload demands and hardware heterogeneity (Ke et al., 2022, Zhong et al., 2024, Cho et al., 26 Feb 2026, Yoon et al., 20 Dec 2025).

In the context of transformer-based LLM serving, disaggregated deployment most commonly refers to the separation of the prefill phase—which is highly compute-bound, as it requires a full forward pass over a long prompt—and the decode phase—which is memory-bandwidth bound, as it involves autoregressive generation using a progressively growing key/value (KV) cache. Disaggregated serving eliminates the negative interference between these phases previously observed in colocated systems, where prefill jobs would monopolize compute, starving latency-sensitive decoding and causing tail SLO violations (Zhong et al., 2024).

Disaggregation extends beyond LLMs. In large-scale recommendation systems, separation of compute engines and high-capacity memory nodes for embedding table sharding produces significant Total Cost of Ownership (TCO) reductions, resource pooling flexibility, and robust failure isolation (Ke et al., 2022). In modern datacenters and edge/fog deployments, resource boards for computation, memory, storage, and network are individually manageable, with composable allocation and fine-grained maintenance (Ekane et al., 2021, Ajibola et al., 2019).

2. Key Architectural Patterns

Across domains, several recurring patterns characterize disaggregated serving systems:

a. Pooling and Composable Resource Fabrics:

Resources of each type (compute, memory, storage) are provisioned in pools and aggregated through software-defined control planes or hardware fabrics (e.g., PCIe, InfiniBand, CXL). Work is mapped to components based on available capacity, real-time load, and data dependencies. Pools may be physically dispersed but present a unified logical namespace (Yoon et al., 20 Dec 2025, Ke et al., 2022).

b. Phase-Oriented Separation (e.g., Prefill–Decode):

LLM serving disaggregates the prefill and decode phases, either at the server level (distinct GPU workers), inside a single device via SM or CU masking (e.g., semi-PD (Hong et al., 28 Apr 2025), RAPID-Serve (Masood et al., 16 Jan 2026)), or temporally (EcoServe's epochs (Du et al., 25 Apr 2025)). Decode workers operate on precomputed KV caches, achieving high memory-bandwidth efficiency and batch utilization without compute-heavy interference from prefill jobs.

c. Layer/Operator Disaggregation:

Full “Transformerless” architectures (e.g., xDeepServe (Xiao et al., 4 Aug 2025)) go further by splitting attention, feedforward, and MoE operator execution into modular, schedulable subgraphs across distributed NPUs, with dedicated communication libraries (XCCL) and inter-module load-balancing and fault isolation.

d. Disaggregated Caching and Data Management:

KV caches and embedding tables can reside in rack- or cluster-wide shared memory, host DRAM/SSD, or pooled memory nodes, with direct GPU or CPU access mediated via custom memory management, two-tier synchronization, and prefix-aware indexing (e.g., TraCT (Yoon et al., 20 Dec 2025), Mooncake (Qin et al., 2024), BanaServe (He et al., 15 Oct 2025)).

e. Dynamic Orchestration and Scheduling:

Router and scheduler logic must coordinate resource allocation dynamically, often via load-aware, SLO-driven, or prefix-contention-agnostic policies. Techniques include global KV store with overlapped transmission, per-task micro-request splitting, proactive phase rolling, bottleneck-based batch composition, and online/offline planning (He et al., 15 Oct 2025, Ruan et al., 12 Apr 2025, Wu et al., 26 Nov 2025, Du et al., 25 Apr 2025).

3. Operational Mechanisms: Synchronization, Scheduling, and Data Movement

Disaggregated serving requires explicit orchestration of inter-phase transitions, data movement, and fine-grained resource balance:

a. Synchronization and Consistency:

Modification of global or distributed caches introduces metadata races and visibility issues. Systems such as TraCT address non-coherent CXL shared memory by introducing two-tier (local/global) lock hierarchies: mutexes on each node, and a lock manager thread that arbitrates access to shared lock slots in memory. All metadata updates are followed by explicit cacheline flushes to guarantee coherence semantics for other nodes, while large KV transfers exploit DMA engine semantics for direct, cache-bypassed data movement (Yoon et al., 20 Dec 2025).

b. Cache and State Management:

KV caches are indexed by prefix hashes in fixed-size, cache-aligned hash tables or managed via LRU lists with reference counting. Dynamic eviction, layer-wise overlapping, or hot-spot replication (as in Mooncake) are employed to optimize memory capacity and minimize stale fetches. Global KV stores (e.g., BanaServe) decouple compute allocation from state, enabling routers to perform load-aware scheduling independent of local cache content (He et al., 15 Oct 2025, Qin et al., 2024).

c. Adaptive Routing and Load Balancing:

AMDP (He et al., 16 Feb 2026) and related frameworks use real-time statistics (TTFT, ITL) at both prefill and decode workers to adaptively route incremental workloads, optimizing for SLO fulfillment. Adaptive reordering of prefill queues maximizes SLO rate by permuting tasks within a look-ahead window. Macro-schedulers at the instance or cluster level orchestrate rolling phase transitions (EcoServe (Du et al., 25 Apr 2025)), elastically adjusting capacity through mitosis-like splitting and merging.

d. Data Movement Acceleration:

KV transfers, traditionally over RDMA with host-NIC-DRAM hops, become the dominant performance bottleneck as model and context sizes scale. TraCT eliminates this by leveraging CXL shared memory for direct, single-hop load/store and GPU DMA between devices (640 ns per hop, ~10 GB/s), yielding 4.2× speedup for 1 MB blocks and a 9.8× reduction in TTFT over RDMA baselines (Yoon et al., 20 Dec 2025). P/D-Serve (Jin et al., 2024) implements block-free, contiguous-buffer RDMA with per-scenario dynamic prefill/decode ratio tuning to further reduce bandwidth and latency overhead.

4. Performance Modeling, Evaluation, and Trade-Offs

Performance analysis in disaggregated serving systems rigorously models compute, memory, and network components, and is tightly linked to workload-aware scheduling:

a. Latency and Throughput Models:

For phase-wise disaggregation (DistServe (Zhong et al., 2024), semi-PD (Hong et al., 28 Apr 2025)), TTFT and TPOT are modeled as queueing processes (e.g., M/D/1, M/M/1), decomposing request latency into queuing delay, compute, and network transfer components: $TTFT = T_{\mathrm{prefill}} + T_{KV\,transfer} + T_{\mathrm{decode}}$ Prefix-aware caching and fast, in-fabric KV transfers deliver transformative reductions in TTFT (up to 9.8×) and P99 latency (up to 6.2×) (Yoon et al., 20 Dec 2025).

b. Resource Utilization and Efficiency:

Fine-grained disaggregation enables independent scaling and utilization maximization. BanaServe (He et al., 15 Oct 2025) achieves both compute and memory utilization within 5 percentage points of 95%, maintaining linear throughput scaling as load increases, in contrast to static disaggregated and monolithic designs that saturate early or waste resources.

c. Trade-Offs:

Disaggregation introduces overheads (e.g., KV cache transfers, synchronization costs, increased average hop counts and marginal power/network traffic (Ajibola et al., 2019)) that must be balanced against gains in SLO adherence, concurrency, and cost savings. Notably, design choices such as block size in RDMA, layer-wise versus full-KV transfer, CXL capacity limits, and local cache pinning each influence aggregate system efficiency.

5. Extensions: Multimodal, Multi-Model, and Heterogeneous Serving

Recent advances generalize disaggregated serving across modality, workload, and hardware boundaries:

a. Multimodal and Stage-Level Disaggregation:

vLLM-Omni (Yin et al., 2 Feb 2026) and EPD-Serve (Bai et al., 5 Jan 2026) present “stage abstraction” and three-way “EPD” decomposition (Encode, Prefill, Decode), enabling construction of any-to-any pipelines for complex multimodal models. Each stage functions as an independent execution engine with individualized resource allocation, batching, and data connectors. Stage-level disaggregation supports flexible deployment, horizontal and vertical scaling, and optimized SLO management for mixed text, audio, and vision workloads.

b. Multi-Model and Multi-Agent Sharing:

PrefillShare (Woo et al., 12 Feb 2026) factorizes the prefill module as a frozen, shared component, allowing multiple specialized decode modules to reuse a computed KV cache for identical prompt prefixes. This eliminates redundant KV computation and memory allocation across models, reducing p95 tail latency by up to 4.5× and enhancing throughput and hit ratios in multi-agent serving.

c. Heterogeneous Hardware and Simulation:

LLMServingSim 2.0 (Cho et al., 26 Feb 2026) provides a unified simulation environment to jointly analyze disaggregated serving and hardware heterogeneity (e.g., GPUs, NPUs, DRAM, CXL, PIM). The simulator models runtime scheduling, data movement, network contention, and power, achieving sub-1% error vs. real deployments—a tool for co-designing next-generation infrastructure.

6. Limitations, Challenges, and Future Directions

Despite its benefits, disaggregated serving poses significant challenges:

Shared memory scale and coherence: Current CXL devices offer TB-scale capacity but lack hardware cache coherence, requiring explicit synchronization and incurring CPU overhead (Yoon et al., 20 Dec 2025).
System bottlenecks and scalability: Single lock managers, metadata contention, and central schedulers may become critical bottlenecks in very large racks or multi-rack deployments, necessitating hierarchical or distributed approaches.
Trade-off selection and tuning: Determining optimal prefill/decode ratios, buffer sizes, eviction policies, and communication protocols requires workload-aware modeling and online adaptation (Jin et al., 2024, Du et al., 25 Apr 2025).
Applicability to new workloads: Stage-wise, multi-agent, and multi-modal extensions (e.g., vLLM-Omni, EPD-Serve) introduce fragmentation, tail-latency risk, and the need for automated graph optimization across stages (Yin et al., 2 Feb 2026).

Open questions include the integration of hardware-coherent memory for simplified synchronization, cross-platform peer DMA for KV sharing, extension to hybrid cloud/edge or federated environments, and the generalization of disaggregated orchestration mechanisms to other data-intensive domains (databases, analytics engines).

7. Representative Quantitative Outcomes

The operational impact of disaggregated serving systems is substantial across application areas. Select performance results include:

System	Key Gains vs. Baseline
TraCT (Yoon et al., 20 Dec 2025)	TTFT up to 9.8× lower; P99 latency up to 6.2× lower; 1.6× higher throughput
DistServe (Zhong et al., 2024)	2–4.5× peak goodput; 7.4–10.2× tighter SLO attainable
BanaServe (He et al., 15 Oct 2025)	1.2–3.9× throughput, 1.4–70.1% lower latency vs. leading baselines
EcoServe (Du et al., 25 Apr 2025)	65–85% goodput gain vs. non-disaggregated on commodity clusters
Mooncake (Qin et al., 2024)	Up to 525% throughput gain in long-context workloads
DisaggRec (Ke et al., 2022)	21–49.3% TCO reduction, near-monolithic throughput, with flexible scaling
PrefillShare (Woo et al., 12 Feb 2026)	4.5× lower p95 latency, 3.9× higher throughput for multi-model agents

These results demonstrate the transformative effect of disaggregation on both resource efficiency and user-perceived latency, especially under tight SLOs and scalable, heterogeneous deployments.

In sum, the disaggregated serving system paradigm is a foundational architectural principle for next-generation large-scale AI serving, characterized by logical and physical phase/operator separation, dynamic orchestration, and high-performance intra/inter-node communication. It enables robust, cost-effective, and SLO-compliant deployment of LLMs, recommendation engines, and multimodal inferencing in both datacenter and cloud-edge environments (Yoon et al., 20 Dec 2025, Ke et al., 2022, Zhong et al., 2024, Cho et al., 26 Feb 2026, Du et al., 25 Apr 2025, He et al., 15 Oct 2025, Jin et al., 2024, Woo et al., 12 Feb 2026).