Hardware Disaggregation Architectures

Updated 13 November 2025

Hardware disaggregation is an architectural paradigm that decouples traditional server-bound resources into specialized, shared resource pools, enabling flexible provisioning.
Key advancements in interconnect technologies like PCIe, CXL, and optical fabrics, along with dynamic orchestration, improve performance and lower TCO.
Challenges include managing latency overhead, control plane scalability, and maintaining memory coherency, driving further research in hardware-software co-design.

Hardware disaggregation is an architectural paradigm in which traditional server-bound resources—such as CPUs, DRAM, non-volatile memory, accelerators, and network interfaces—are decoupled from their local enclosures and reorganized into shared, composable resource pools connected by high-performance fabrics. In an idealized disaggregated datacenter, these resource-specific nodes are dynamically allocated and re-composed on demand, enabling flexible, fine-grained provisioning that better aligns resource allocation with workload requirements, reduces resource stranding, and supports hardware heterogeneity (Guo et al., 6 Nov 2025). The transition to disaggregated hardware reflects advances in interconnect technologies (PCIe, CXL, optical), operating system co-design, and system-level orchestration, fundamentally altering the structure and operation of data center ecosystems.

1. Architectural Principles and Taxonomy

Hardware disaggregation transforms the canonical “server” into a set of specialized resource nodes—compute blades (CPU, local cache), memory blades (DRAM/PMEM), accelerator nodes (GPUs, TPUs, FPGAs, DPUs), and storage devices—linked by low-latency, high-bandwidth fabrics such as PCIe (Gen4/Gen5/Gen6), CXL (Compute Express Link), RDMA-capable Ethernet (RoCE), InfiniBand, or photonic DWDM (Wang et al., 26 Mar 2025, Michelogiannakis et al., 2023, Ghandeharizadeh et al., 2 Nov 2024). Architecturally, these systems are classified by composition scale:

Scale	Description	Network/Fabric	Typical Latency (ns)
Rack-scale	Per-rack pools, tightly coupled	CXL 2.0/3.0, PCIe	100–400
Pod/Cluster-scale	Cross-rack pooling	Optical, Gen-Z	400–2,000+
Hierarchical	Node/rack/cluster multipooling	CXL + PCIe + RDMA	100–2,000+

A key taxonomy dimension is the degree of resource specialization: “CPU-only,” “memory-only,” and “accelerator-only” nodes exist, but emerging systems add hybrid nodes (e.g., FPGA+NIC, DPU+SSD) to maximize locality for certain application patterns. Disaggregated architectures are also distinguished by pooling and allocation granularity, e.g., block/page sizes for memory, or accelerators allocated per task or job (Guo et al., 6 Nov 2025, Wang et al., 26 Mar 2025, Michelogiannakis et al., 2023).

2. Interconnect Technologies and Performance Models

The emergence of hardware disaggregation has been driven by rapid advances in interconnects that balance bandwidth and latency constraints. PCIe Gen4/Gen5 x16 provides 32–64 GB/s per link, and CXL 2.0 reaches 70 GB/s on the same physical links with sub-500 ns one-way latencies; CXL 3.0 and Gen-Z promise further improvements (Wang et al., 26 Mar 2025, Yelam, 2022, Guo et al., 6 Nov 2025). Optical DWDM with AWGR topologies can deliver per-node escape bandwidths exceeding 6.4 Tbps, yielding conservative DRAM-to-chip round-trips of ≈35 ns at the intra-rack scale (Michelogiannakis et al., 2023).

Formal models describe the end-to-end latency for remote memory or device access as:

$L_\mathrm{total} = L_\mathrm{local} + d \cdot L_\mathrm{remote}$

where $L_\mathrm{local}$ is the CPU cache/memory controller delay (e.g., 100 ns), $d$ is the number of traversals (switch hops), and $L_\mathrm{remote}$ is the one-way network/fabric delay (300–2,000 ns depending on fabric) (Wang et al., 26 Mar 2025). Peak achievable throughput is similarly modeled as:

$T = f_\mathrm{pkt} \cdot S_\mathrm{pkt}, \quad \eta = \frac{G}{C}$

where $f_\mathrm{pkt}$ is the packet rate, $S_\mathrm{pkt}$ is the packet size, $G$ is offered load, $C$ is link capacity, and $\eta$ is link utilization (Wang et al., 26 Mar 2025).

Bandwidth and latency constraints dominate performance—for instance, CXL-attached DRAM sees read latencies of 250–500 ns; RDMA swap or remote cache over 100 GbE delivers bandwidths of 12–50 GB/s and latencies of 1–2 μs. Photonic fabrics further compress these bounds, enabling practically node-local bandwidth with small (<35 ns) latency additions (Michelogiannakis et al., 2023).

3. System Integration and Orchestration

For hardware disaggregation to be practical, seamless cross-layer system integration is essential. This includes:

Hardware/OS Interface: Disaggregated resources are presented as NUMA nodes (CXL DRAM), or as remote block devices (NVMeoF), with support in Linux SLAB/page allocators, frontswap, and hypervisor hot-plug APIs (Wang et al., 26 Mar 2025, Ghandeharizadeh et al., 2 Nov 2024).
Orchestration Layer: Centralized or distributed orchestrators monitor resource utilization (CPU load, memory footprint, RDMA queue depths, tail-latency percentiles) and dynamically compose workloads from resource pools based on workload-specific or SLA-driven policies (Ghandeharizadeh et al., 2 Nov 2024). Typical monitoring periods are 5 s, with thresholds for upper/lower utilization (e.g., 70%/30%) triggering resource scaling.
Scheduling Model: Resource allocation is often cast as an ILP (Integer Linear Program); for example, assign request $r$ exactly one $n\in\mathcal{N}^C_p$ (CPU), $M_r$ memory from $\mathcal{N}^M_p$ , and at most one accelerator, all from the same pool, while minimizing weighted total resource cost and respecting local/remote latency/service constraints (Guo et al., 6 Nov 2025).

Dynamic resource orchestration allows for workload placement, rapid scale-out/scale-in, and failure isolation (distinct domains for compute, memory, accelerators). This elasticity is critical for handling both diurnal load variations and the rapidly evolving demands of AI/ML and analytics workloads (Ke et al., 2022, Ghandeharizadeh et al., 2 Nov 2024).

4. Performance, Efficiency, and Trade-offs

Disaggregated architectures yield substantial improvements in resource efficiency by enabling fine-grained allocation and independent resource scaling, but their performance is fundamentally co-determined by interconnect characteristics and system-level orchestration.

Throughput and Latency: Microbenchmarks show that CXL 1.1 delivers read latency ≈330 ns, while advanced RDMA (Infiniswap) achieves up to 10 GB/s swap bandwidth—a 2× improvement over disk-based swap (Wang et al., 26 Mar 2025). In photonic racks, average CPU/GPU slowdowns of 15–22% (CPU) and 5.3% (GPU) relative to non-disaggregated baselines are reported, orders of magnitude better than electronic-switched rack-disaggregation (Michelogiannakis et al., 2023).
Energy and Cost: Power models attribute CXL DIMMs ≈3 W, optical links ≈1 W per 100 Gb/s, and RDMA NICs ≈20 W. System-level energy per far-memory operation is $E_\mathrm{op}=P_\mathrm{avg}·t_\mathrm{op}$ . Large-scale evaluations find 30–50% reductions in total cost of ownership (TCO) over monolithic clusters owing to improved utilization and elastic redundancy (Ke et al., 2022, Ghandeharizadeh et al., 2 Nov 2024).
Resource Efficiency: Pooling yields consolidation: experiments show 4× fewer memory modules and 2× fewer NICs compared to tightly coupled, scale-up racks at iso-performance (Michelogiannakis et al., 2023). Pools can be dynamically grown, shrunk, or rebalanced according to workload needs and are protected from cross-resource failure cascades (Ke et al., 2022).
Trade-offs: Disaggregation introduces a latency penalty ( $\Delta t_\mathrm{overhead} = t_\mathrm{CXL} - t_\mathrm{DDR} \in [30,170]$ ns per access) and may require 6–15% extra CPUs/GPUs to hit baseline throughput under worst-case locality (Guo et al., 6 Nov 2025, Michelogiannakis et al., 2023). Composition and dynamic orchestration carry control-plane complexity and scheduling overheads, particularly for heterogeneous, multi-tenant environments.

5. Exemplary Implementations and Case Studies

Published systems span design points and domains:

Memory Disaggregation: Systems such as Clio implement virtual memory translation, permission checking, and DMA logic entirely on FPGA-based memory nodes, achieving median one-way remote memory latency of 2.5 μs at 100 Gb/s bandwidth, and energy per byte ≈0.33× that of CPU/NIC-based paths (Guo et al., 2021). DaeMon transparently adapts data-movement granularities to network conditions, achieving 2.4× geometric mean speedup and 3× reduction in remote-access latency compared to pure page migration (Giannoula et al., 2023, Giannoula et al., 2023).
Storage and NVM Disaggregation: Composable NVMe-over-Fabrics (NVMeoF) supports distributed “memory blades” for data-intensive genomics, with per-SSD utilization monitored and orchestrated at central SDI controllers, and throughput scaling linearly with RAID-0 composition until flash-internal parallelism saturates (Call et al., 2020).
Network Disaggregation: Evolution of OS abstractions (e.g., LegoOS, disaggregated nComponents) and device-side logic (SmartNICs, DPUs) enables direct storage/network/accelerator orchestration, transparent loopback, and fine-grained connection setup, with performance near native monolithic servers for small packets and improved resource utilization (Ekane et al., 2021, Park et al., 2023).
Co-packaged Photonics: Intra-rack photonic fabrics using AWGR switches and comb lasers achieve 44% chip-count reduction at <35 ns latency penalty, supporting 6.4 Tbps per MCM and matching node-local bandwidth (Michelogiannakis et al., 2023).

6. Challenges, Open Problems, and Future Directions

Current research highlights several unresolved issues that delimit the further adoption of hardware disaggregation:

Control Plane Scalability: Centralized resource managers may bottleneck above 1,000 memory nodes or at fine temporal reallocation granularity; scalable, federated controllers and in-network directories are under investigation (Wang et al., 26 Mar 2025, Guo et al., 6 Nov 2025).
Elasticity, Power, and Cooling: Power gating, dynamic sleep states, and rack-level cooling systems must be coordinated with resource allocation to ensure efficiency under rapidly shifting utilization (Guo et al., 6 Nov 2025, Michelogiannakis et al., 2023).
Coherence and Consistency: Maintaining memory coherency, QoS isolation, and access security with multiple compute nodes caching and modifying remote data is non-trivial; approaches include hardware-enforced permission tables, transactionally-safe “memory grant/steal” APIs, and region-based hardware isolation (Angel et al., 2019, Heo et al., 2021, Guo et al., 2021).
Heterogeneity: Supporting diverse xPU clusters with mixed bandwidth/latency needs and dynamic workload patterns requires multi-tiered fabrics and unified orchestration frameworks capable of cross-layer co-optimization (Wang et al., 26 Mar 2025, Ghandeharizadeh et al., 2 Nov 2024).
Software Infrastructure: Exposing disaggregation details to applications and higher-level orchestrators (e.g., via Kubernetes plugins) allows for performance-aware scheduling but increases complexity (Guo et al., 6 Nov 2025, Angel et al., 2019).
Formal Verification and Correctness: Composing microservices and resource pools in real-time necessitates formal verification to preserve transactional and consistency guarantees during re-composition (Ghandeharizadeh et al., 2 Nov 2024).

A plausible implication is that future datacenter design will require cross-disciplinary co-design among hardware, control-plane scheduling, physical infrastructure (cooling, power), and software stack components to fully exploit the gains from hardware disaggregation while bounding complexity and power consumption, especially as system scales and resource types continue to proliferate.