Compute–Memory Disaggregation

Updated 5 December 2025

Compute–memory disaggregation is the architectural separation of compute and memory resources, enabling a logical shared-memory system across physical boundaries.
Key methodologies include object-, page-, and cache-line-level strategies combined with high-speed fabrics such as RDMA, CXL, and InfiniBand to minimize latency and optimize scalability.
Empirical results reveal benefits like up to 63% DRAM reduction and significant TCO savings, underscoring its impact on HPC performance and flexible resource provisioning.

Compute–memory disaggregation refers to the architectural separation of compute resources (CPUs, accelerators) and memory resources (DRAM, NVM) across physical boundaries, interconnecting them via high-speed networks to form a logical shared-memory system. This model has become fundamental for large-scale datacenter efficiency, high-performance computing (HPC) scalability, and flexible resource provisioning. Disaggregation aims to address underutilization in memory capacity, total cost of ownership (TCO) inefficiencies, and to support elastic scaling by dynamically allocating memory to match diverse and variable compute workloads (Maruf et al., 2023, Zheng et al., 2 Dec 2025, Ding et al., 2023). Key implementations leverage scalable fabrics (InfiniBand, RDMA, CXL, Ethernet), object- or page-level remote access abstractions, and precise software/hardware co-design to minimize the inherent penalties of increased latency and bandwidth heterogeneity.

1. Architectural Principles and System Models

Compute–memory disaggregation decouples DRAM from compute nodes, making memory available either from dedicated memory blades or from DRAM pooled across other servers (Maruf et al., 2023, Ding et al., 2023, Ke et al., 2022). Architectures can be divided into:

Physical disaggregation: compute blades attach via PCIe/CXL/Gen-Z fabrics to pure memory blades; all off-node requests traverse the fabric (Maruf et al., 2023).
Logical disaggregation: DRAM sharing via RDMA or Ethernet among peer servers, requiring minimal hardware changes but relying on OS/runtimes for memory pooling (Maruf et al., 2023, Abrahamse et al., 2022).
Hybrid/cache-coherent: cache-coherent domain spanning multiple nodes using protocols such as CXL or Gen-Z, presenting a single address space for the CPU and memory pool, including support for accelerators and dynamic hot-plug (Maruf et al., 2023, Yang et al., 22 Mar 2025).

Fundamental performance parameters include local DRAM latency $L_\ell \approx 50$ –100 ns and bandwidth $B_\ell \approx 100$ –200 GB/s, versus remote memory access latency $L_r = 1$ –5 μs and bandwidth $B_r = 50$ –100 GB/s, depending on the interconnect and protocol (Zheng et al., 2 Dec 2025, Yang et al., 22 Mar 2025).

A typical performance model for HPC applications in such a system relates the runtime $T$ to local and remote accesses:

$T = T_\text{compute} + D_\text{local}/B_\text{local} + D_\text{remote}/\min(B_\text{inj}, B_\text{bis})$

where $D_\text{local}, D_\text{remote}$ are data volumes and $B_\text{inj}, B_\text{bis}$ are injection and bisection bandwidths (Ding et al., 2023).

2. Object-, Page-, and Cache-Line-Level Disaggregation Methods

Granularity of remote memory management is a critical dimension. Early solutions relied on OS-level page swap (paging to remote DRAM), incurring large slowdowns for poorly clustered access patterns (Ding et al., 2023, Zheng et al., 2 Dec 2025). Advanced strategies introduced:

Object-level placement: DOLMA partitions individual application data objects (matrix, vector, particle array) between local and remote DRAM, using a cost model $T = \sum_j f_j [x_j t_\ell + (1 - x_j) t_r]$ and a knapsack-style optimization to minimize execution time under a fixed local DRAM constraint (Zheng et al., 2 Dec 2025).
Cache-line/block-level access: Hardware disaggregated schemes (e.g., DRackSim) can forward individual cache misses via the fabric, modeling remote access as queuing in M/M/1 servers for NIC and memory pool (Puri et al., 2023).
Multi-granularity migration and adaptive selection: DaeMon leverages hardware support for both page migrations and fine-grained 64 B cache-line moves, partitioning link bandwidth to favor latency-critical traffic and using buffer-based thresholds for adaptive granularity decisions (Giannoula et al., 2023).

Empirical studies show that object/block-level partitioning regimes vastly outperform page-only migration for bandwidth- and latency-sensitive workloads (Zheng et al., 2 Dec 2025, Giannoula et al., 2023, Puri et al., 2023).

3. Network Fabrics, Latency, and Bandwidth Optimizations

Disaggregated memory demands ultra-low-latency, high-bandwidth networks; protocol and switch-level designs critically influence end-to-end performance.

RDMA/InfiniBand/Ethernet: One-sided RDMA over InfiniBand or Ethernet provides 1–10 μs remote access but incurs software and protocol overhead (Puri et al., 2023, Su et al., 2024).
CXL-based tiered memory: PCIe/CXL enables cache-coherent expansion with 150–200 ns device latency and tens of GB/s per device; CXL-aware OS policies are essential to balance concurrency and avoid bandwidth collapse (Yang et al., 22 Mar 2025, Sun et al., 2023).
PHY-layer fabric innovations: EDM shifts the entire protocol stack for remote memory access into the Ethernet PHY, eliminating MAC/min-frame/IFG overhead, and introduces an in-PHY centralized matching scheduler to deliver 300 ns single-hop latency and $<$ 1.3 × unloaded latency under load, an order of magnitude improvement over RoCEv2 and TCP/IP (Su et al., 2024).
Optical interconnects: Solutions like OCM use silicon photonics with micro-ring resonators, delivering up to 615 Gb/s over two fibers at 1.07 pJ/bit, outperforming traditional PCIe-NIC approaches by up to $5.5\times$ in latency benchmarks (Gonzalez et al., 2020).

Network-level congestion and request scheduling are addressed via hardware throttling (MIKU), dynamic bandwidth partitioning, and link-level compression for page migrations (Giannoula et al., 2023, Yang et al., 22 Mar 2025).

4. Consistency, Concurrency, and Synchronization Protocols

Ensuring correct sharing of data across compute nodes without central CPUs in memory blades requires lightweight software/hardware coherence:

Software-coherence protocols: Approaches such as SELCC introduce an RDMA-based shared–exclusive latch protocol that embeds coherence metadata into a 64-bit latch word. The protocol aligns with MSI coherence state transitions and leverages one-sided atomics for scalable, strongly consistent memory sharing, achieving $2\times$ – $6\times$ throughput improvements over RPC-based protocols (Wang et al., 2024).
In-network coherence/directory: MIND centralizes coherence metadata, permissions, and address translation in a programmable switch ASIC, reducing common-case access to a single one-way RDMA and achieving strong scaling up to 12 cores and rack-scale elasticity without per-node metadata broadcasts (Lee et al., 2021).
Hybrid threading and buffer management: Mechanisms like DOLMA’s dual-buffer pipeline (per-object double-buffering with overlap via RDMA completion callbacks) and per-thread prefetch queues maintain multi-threaded concurrency without global locks while hiding remote-access latency (Zheng et al., 2 Dec 2025).
Cache-line vs. page coherence trade-off: Fine-grained coherence strictly increases metadata and invalidation messaging, which can be mitigated by partitioned directory structures or buffer-based adaptive protocols (Maruf et al., 2023, Wang et al., 2024).

Relaxed-consistency memory (e.g., PRAM, FIFO, or future hardware in CXL 3.0) is an open area for further latency reduction (Wang et al., 2024).

5. Application Domains and Empirical Effectiveness

Compute–memory disaggregation has been systematically evaluated across HPC, data analytics, recommendation systems, and big-data frameworks:

HPC workloads: In DOLMA, eight representative kernels (NAS BT, FT, SP; Laghos; stencils; HPL; HPCG; molecular dynamics) experience $<$ 16% slowdown with an average $63\%$ reduction in local DRAM and $>90\%$ prefetch overlap. Stencil and LU factorization workloads show highest DRAM savings (Zheng et al., 2 Dec 2025).
Datacenter workloads: DisaggRec for large-scale recommendation serving (DenseNet, SparseNet split) achieves up to $49.3\%$ TCO reduction compared to monolithic server clusters by enabling independent scaling and decoupled failure domains. Sequential aggregation at memory nodes limits per-query data transfer to $<8\%$ of monolithic, meeting strict tail-latency SLAs (Ke et al., 2022).
Big-data object stores: Apache Arrow/Plasma combined with ThymesisFlow delivers $88\%$ of local DRAM bandwidth remotely and ~12% bandwidth tax, showing disaggregation is compatible with transparent mmap-based object access (Abrahamse et al., 2022).
Database systems and in-memory analytics: Solutions such as Farview demonstrate that operator offloading to DRAM-backed FPGAs can outperform even local buffer caches in predicate-heavy OLAP queries by pushing filtering, projection, and even encryption to the disaggregated memory (Korolija et al., 2021).
Empirical design-space studies: A comprehensive analysis of 13 HPC workloads showed that $85\%$ of application classes run with zero performance penalty given $100$ GB/s injection and $50$ GB/s rack bisection, with streaming or vector benchmarks as exceptions due to bandwidth limits (Ding et al., 2023).

A selection of empirical performance results is summarized below:

System/Paper	Workload/Metric	Local DRAM Baseline	Disaggregated: Penalty/Benefit
DOLMA (Zheng et al., 2 Dec 2025)	8 HPC Kernels	100% perf.	Slowdown $<$ 16%; 63% DRAM reduction
DRackSim (Puri et al., 2023)	16 Erleb. workloads	Local only	Block: up to 2× slow, Page: up to 5.8× slow
DisaggRec (Ke et al., 2022)	Personalization models	N/A	Up to 49.3% TCO savings
Farview (Korolija et al., 2021)	OLAP DB queries	Xeon local DRAM	2×–3× speedup in filtering, 12 GB/s remote
OCM (Gonzalez et al., 2020)	Mixed benchmarks	PCIe+DRAM	Up to 5.5× faster than 40 G NIC
EDM (Su et al., 2024)	Remote mem. accesses	N/A	~300 ns RTT, $<$ 1.3× latency under load

6. System Bottlenecks and Optimization Techniques

Disaggregated designs face several systemic challenges and remedies:

Bandwidth bottlenecks: Injection bandwidth ( $B_\text{inj}$ ) and bisection bandwidth ( $B_\text{bis}$ ) set fundamental bounds for memory-bound and streaming kernels. Systems that exceed the $L$ : $R \geq B_\text{local} / B_\text{inj}$ access ratio threshold incur significant slowdowns (Ding et al., 2023).
Tiered memory/queuing: CXL-based systems must manage limited CXL module parallelism ( $P_\text{CXL}$ ) vs. DDR ( $P_\text{DDR}$ ); MIKU dynamically throttles CXL traffic based on observed ToR service time, preserving DDR throughput within 5% of solo peak even as CXL load increases (Yang et al., 22 Mar 2025).
Latency hiding/prefetch: Dual-buffer design (DOLMA) and link-compression (DaeMon) allow hiding remote memory fetch time for predictable access patterns or compressible page flows (Zheng et al., 2 Dec 2025, Giannoula et al., 2023).
Dynamic page allocation: Caption’s OS-level dynamic tuning policy automatically steers optimal page placement in multi-tiered CXL+DDR systems, boosting throughput by up to 24% versus static NUMA policies (Sun et al., 2023).

7. Practical Guidelines, Trade-offs, and Future Directions

Effective deployment of compute–memory disaggregation hinges on nuanced hardware, software, and workload-aware strategies:

Granularity and locality: Object- or block-level strategies minimize wasted bandwidth and exploit spatial locality better than page-based schemes, especially for irregular HPC or data analytic patterns (Zheng et al., 2 Dec 2025).
Network engineering: Direct PHY-layer scheduling or photonic interconnects may be warranted at rack or cluster scale to minimize protocol overhead and tight-view latency (Su et al., 2024, Gonzalez et al., 2020).
Resource provisioning: For mixed-workload environments, maintaining at least $100$ GB/s per-node injection and $50$ GB/s rack bisection is necessary to cover the majority of scientific and analytical kernels without performance cliffs (Ding et al., 2023).
Consistency/atomicity: Strong consistency (sequential consistency) protocols are feasible with one-sided RDMA atomics and lightweight in-switch or embedded latch directories, but extremely high numbers of sharers or churn may require further innovation (Wang et al., 2024, Lee et al., 2021).
Adaptivity: Runtime profiling, miss-ratio curves, and performance-driven page/object placement adapt the memory hierarchy to workload demands, ensuring optimal use of both local and remote tiers (Zheng et al., 2 Dec 2025, Sun et al., 2023).
Co-location and near-memory compute: Offloading selected computation (aggregation, filtering, irregular traversals) onto near-memory or CXL-attached logic offers significant end-to-end performance improvements in bandwidth-constrained scenarios (Hermes et al., 2024, Korolija et al., 2021).

Open research remains on multi-host disaggregation at datacenter scale, fine-grained security and isolation, energy-aware memory placement, hardware-coherent fabrics, and dynamic, programmable switch-assisted memory management (Maruf et al., 2023, Lee et al., 2021, Gonzalez et al., 2020).

References:

(Zheng et al., 2 Dec 2025, Puri et al., 2023, Hermes et al., 2024, Wang et al., 2024, Yang et al., 22 Mar 2025, Su et al., 2024, Ding et al., 2023, Giannoula et al., 2023, Gonzalez et al., 2020, Wang et al., 2023, Ke et al., 2022, Wang et al., 2022, Abrahamse et al., 2022, Maruf et al., 2023, Lee et al., 2021, Korolija et al., 2021, Sun et al., 2023).