Disaggregated AI GPUs Overview
- Disaggregated AI GPUs are architectures that decouple GPU compute, memory, and host systems to enable flexible, scalable allocation of resources.
- They employ system-level, resource-level, and function-level disaggregation to optimize performance, reduce energy consumption, and lower operational costs.
- Practical implementations enhance resource pooling, dynamic scaling, and workload scheduling, driving sustainable and efficient AI deployments.
Disaggregated AI GPUs are systems and architectural paradigms in which the tight coupling between GPU compute, memory, and server hosts is systematically broken to enable independent pooling, allocation, and dynamic scaling of GPU resources across datacenters or clusters. This approach has emerged as a key technique to address performance, utilization, sustainability, and cost challenges accompanying the accelerated growth of large AI models and the increasing heterogeneity of available GPU hardware.
1. Principles and Definitions of Disaggregated AI GPUs
Disaggregation in the context of AI GPUs refers to the logical and/or physical separation of resources traditionally deployed together: GPU compute (Streaming Multiprocessors, cores), GPU memory (HBM/DRAM), storage, and often the CPU host. The resulting architecture exposes these resources as individually allocatable units—decoupling assignment and scaling per application or workload phase.
Disaggregated AI GPUs can be classified into several levels:
- System-level disaggregation: Decoupling GPUs from host servers, enabling network-attached "GPU pools" provisioned on demand across the datacenter (He et al., 2023).
- Resource-level disaggregation: Separating compute and memory (e.g., detached memory nodes) and/or decomposing GPU internal architecture (e.g., via on-package disaggregation) (Fu et al., 2021, Moazeni, 2023).
- Function-level disaggregation: Partitioning complex AI serving/training pipelines into phases mapped independently to pooled GPU resources—e.g., prefill/decoding in LLM inference (Liu et al., 22 Sep 2025, Chen et al., 22 Sep 2025, Jiang et al., 11 Feb 2025).
A key motivation is addressing architectural and operational inefficiencies arising from static, monolithic configurations, especially in heterogeneous or multi-vendor hardware environments (Chen et al., 22 Sep 2025, Shi et al., 29 Dec 2024).
2. Architectural Patterns and Systems
2.1 System-Level Disaggregation
Datacenter-scale GPU disaggregation is enabled by decoupling GPU devices from host servers, providing dynamic binding via a network fabric and dedicated hardware proxies. DxPU, for instance, introduces hardware modules that translate PCIe TLPs to network packets, presenting GPUs as "hot-pluggable" PCIe devices to any host anywhere in the datacenter. This enables transparent pooling and flexible allocation at massive scale, with observed performance overheads of <10% for most AI workloads (He et al., 2023).
| System | Scope | Compatibility | Overhead (%) |
|---|---|---|---|
| PCIe-fabric | Rack-scale | Unlimited | ≤1 |
| DxPU | DC-scale | Unlimited | 1–10 |
| User-mode SW | DC-scale | Limited | 5–20 |
2.2 Resource-Level Disaggregation
Emerging hardware architectures further allow disaggregation below the GPU device level:
- Composable On-Package Architecture (COPA-GPU) proposes multi-chip module (MCM) designs separating the GPU core (compute, L1/L2) from memory controller and large L3 cache (MSM module), supporting domain-specialization (DL vs HPC). This approach yields up to 31–35% performance improvements, 3.4× energy reduction, and up to 50% fewer GPU instances required for scale-out training or inference (Fu et al., 2021).
- Co-packaged optics enable chip-to-chip and chip-to-memory bandwidth to reach ≥1 Tb/s with <1 pJ/b energy and ∼10–100 ns latency, facilitating physical disaggregation of compute and memory/storage pools at rack or datacenter scale (Moazeni, 2023).
2.3 Function/Phase-Level Disaggregation
Highly effective in AI serving, especially for LLMs, is the decoupling of inference pipeline phases across GPUs or GPU slices:
- Prefill/Decode Disaggregation: Partitioning the computationally-intensive prefill phase from the memory-bound decoding phase allows each to be scheduled on optimally matched hardware (Liu et al., 22 Sep 2025, Chen et al., 22 Sep 2025, Jiang et al., 11 Feb 2025).
- Partially Disaggregated Prefill: Cronus introduces heterogeneous-aware workload partitioning, where model layers are assigned to GPUs proportional to their compute power, achieving 2.1× throughput and 2× lower batch completion latency vs fully disaggregated or data/pipeline-parallel baselines (Liu et al., 22 Sep 2025).
3. Scheduling, Placement, and Resource Pooling
3.1 Scheduling and Mapping Algorithms
Disaggregated AI GPU usage requires advanced algorithms for partitioning, placement, and dynamic scaling:
- Constraint and Graph-Based Optimization: HexGen-2 formalizes the assignment problem as a graph partitioning and max-flow optimization, incorporating both hardware heterogeneity and KV cache transfer bandwidth as scheduling constraints. Spectral partitioning and the preflow-push max-flow algorithm enable near-optimal throughput (up to 2× vs. SOTA) and 1.5× latency reduction, with cost-equivalent performance at 30% lower budget (Jiang et al., 11 Feb 2025).
- Joint Optimization on Heterogeneous, Multi-Vendor GPUs: Parallel strategy and instance allocation are jointly optimized (e.g., equations for throughput, TTFT, TPOT constraints), with compatibility modules (numerical, VRAM, and parallelism alignment) ensuring robust operation across vendors (Chen et al., 22 Sep 2025).
3.2 Autoscaling and Dynamic Pool Management
- Coordinated Autoscaling (HeteroScale): For P/D disaggregated LLM serving, metric-driven policies (throughput, not naive GPU utilization) are needed. Resource pools for prefill/decode are scaled jointly to maintain empirically optimal phase ratios. Topology- and affinity-aware placement avoids transfer bottlenecks (20% bandwidth loss if ignored), yielding 26.6 percentage points higher GPU utilization and substantial cost savings at production scale (Li et al., 27 Aug 2025).
3.3 Multi-tenancy and Fine-Grained Disaggregation
- Multi-Instance GPU (MIG) Partitioning and Placement: Multi-tenant slicing (MIG in NVIDIA A100/H100) enables hardware-isolated partitions, maximally utilized through dynamic workload placement optimized via MIP or heuristics for minimization of GPU usage and wastage. Up to a 2.85× reduction in GPUs used and 70% reduction in wastage is observed over heuristic first-fit, and these schemes are production-viable at cluster scale (Turkkan et al., 10 Sep 2024).
- Dynamic Partitioning Algorithms: MISO demonstrates that MPS-based profiling can accurately predict performance across all MIG slice types, supporting frequent partitioning and optimal mapping as workloads change, with up to 49% lower job completion time than unpartitioned, and outperforming static partitioning by 16% (Li et al., 2022).
4. Application Domains and Workflows
4.1 LLM and AI Inference
- Elastic and Unified Execution: Systems such as DynaServe move beyond static prefill/decode partitioning, splitting requests at arbitrary points and batch-adapting sub-requests for optimal compute and memory utilization. This model exhibits up to 4.34× higher goodput and consistently low P99 latency in hybrid workloads (Ruan et al., 12 Apr 2025).
- KV Cache Transfer Optimization: In disaggregated architectures, transfer of large KV caches can be a major bottleneck. FlowKV’s memory reshaping, segment-based allocation, and single-call transfer techniques yield up to a 96–98% reduction in transfer latency, obtaining 1.15–2× higher throughput over baselines (Li et al., 3 Apr 2025).
- Unified Storage, Asynchronous Computation: The semi-PD architecture removes overhead of KV cache transfer by keeping both prefill and decode as independent, asynchronous workers sharing a unified storage pool on a single or multi-GPU host, adjusting compute resource slices using dynamic SLO-aware partitioning. Latency is reduced 1.27–2.58× at high request rates, and SLO adherence is improved 1.55–1.72× (Hong et al., 28 Apr 2025).
- Sustainable Deployment: GreenLLM demonstrates that functional disaggregation (prefill on new GPU, decode on old GPU; or speculative decoding splits) reduces LLM-serving carbon emissions by up to 40.6% while meeting latency SLOs for >90% of requests (Shi et al., 29 Dec 2024).
| Functional Disaggregation | Key Mechanism | Benefit |
|---|---|---|
| P/D Split | Phases to matched hardware | Latency, cost, & utilization |
| Micro-request (DynaServe) | Arbitrary phase boundary | Elasticity, utilization, SLOs |
| P/D + Speculative Decode | Draft/target models split | Sustainability, low bandwidth req. |
4.2 Multimodal LLM Training
- Disaggregated Model Orchestration: DistTrain explicitly decomposes large multimodal LLMs into modules (e.g., encoders, backbone, generators), allocating independently optimized DP/TP/PP parallelism and resources. This strategy, combined with batch/data reordering, increases Model FLOPs Utilization (MFU) to 54.7% (on 1172 GPUs with a 72B model), up to 2.2× throughput of Megatron-LM (Zhang et al., 8 Aug 2024).
4.3 HPC and Specialized Workloads
- Remote AI Acceleration for HPC: For cognitive simulation (CogSim), in-the-loop inference with small batch sizes and tight latency SLOs can favor disaggregated, network-attached AI accelerators over node-local GPUs, as shown by up to 3× better throughput at small batch sizes in real surrogates (Hermit, MIR) using RDMA-attached dataflow engines (II et al., 2021).
- Graph Analytics: TEGRA organizes compute and memory as separate, composable pools, using hardware-active messages and CXL-accessed memory for terascale graphs, improving memory and compute utilization compared to traditional scale-out clusters (Shaddix et al., 4 Apr 2024).
5. Technological Enablers and Challenges
5.1 High-Bandwidth, Low-Latency Interconnects
- Co-Packaged Optics present the most plausible path to enable memory and compute disaggregation at scale, promising >1 Tb/s bandwidth, <1 pJ/b energy, and <100 ns latency for chip-to-chip links (Moazeni, 2023).
- PCIe-level Disaggregation, as in DxPU, leverages PCIe TLP-NIC conversion and up to 2×100GbE links, maintaining transparency and high compatibility (He et al., 2023).
5.2 Scheduling and Data Movement Overheads
The effectiveness of disaggregated AI GPU systems is contingent on optimized scheduling (phase assignment, batch formation, instance placement) and minimizing communication overhead (e.g., through KV cache reshaping (Li et al., 3 Apr 2025)). Limitations arise if workload or micro-batch granularity is too fine or if memory bandwidth and interconnects are insufficient.
5.3 Resource Utilization, Cost, and Energy
Disaggregation, by enabling more granular pooling and dynamic mapping, is empirically shown to reduce required GPU count (up to 2.85× fewer in optimal MIG placement (Turkkan et al., 10 Sep 2024)), eliminate storage/compute imbalances (Hong et al., 28 Apr 2025), and support sustainable AI deployments using older hardware (Shi et al., 29 Dec 2024). Specialized architectures (COPA-GPU) further reduce datacenter cost and energy consumption (Fu et al., 2021).
6. Implications, Limitations, and Future Directions
Disaggregated AI GPU architectures underpin the next generation of scalable, efficient, and sustainable AI infrastructure. Future directions include:
- Programmable and Composable Hardware: Wider adoption of MCM designs, smart optical I/O, and composable GPU/memory pools (Fu et al., 2021, Moazeni, 2023).
- SLO-aware, Adaptive Scheduling: Increased sophistication in placement and control algorithms for dynamic workloads and heterogeneous fleets (Li et al., 27 Aug 2025, Jiang et al., 11 Feb 2025).
- Fine-Grained Partitioning: Expansion of hardware/software support for MIG-like slicing, potentially extending to per-streaming-multiprocessor or functional subunit allocation.
- Cross-vendor, Multi-modal Support: Generalized compatibility across an increasingly diverse set of accelerator types, memory hierarchies, and workload compositions (Chen et al., 22 Sep 2025, Zhang et al., 8 Aug 2024).
- Sustainability Objectives: Systematic integration of embodied carbon, lifetime amortization, and e-waste reduction in resource allocation frameworks (Shi et al., 29 Dec 2024).
Notably, while disaggregation delivers significant gains for many classes of AI workloads, scenarios with extremely tight host-device communication (short kernels, rapid context-switching) can still expose performance limitations due to network-induced latency (He et al., 2023). The appropriate design point thus depends on workload characteristics, hardware capabilities, and operational objectives.
References: For primary results, technical discussions, and architectural diagrams consult (Liu et al., 22 Sep 2025, Chen et al., 22 Sep 2025, Moazeni, 2023, Li et al., 27 Aug 2025, Hong et al., 28 Apr 2025, Jiang et al., 11 Feb 2025, He et al., 2023, Shi et al., 29 Dec 2024, Fu et al., 2021, Turkkan et al., 10 Sep 2024, Zhang et al., 8 Aug 2024, II et al., 2021, Shaddix et al., 4 Apr 2024, Li et al., 2022, Li et al., 3 Apr 2025, Ruan et al., 12 Apr 2025, Byun et al., 29 Oct 2024).