RDMA-Enabled Memory Disaggregation
- RDMA-enabled memory disaggregation is the technique of decoupling server memory from compute nodes, enabling direct, low-latency remote memory access.
- The approach leverages one-sided RDMA verbs and fine-grained security to achieve high bandwidth and near-local DRAM performance while mitigating interference.
- Diverse architectures, including hybrid DRAM+DCPM systems and FPGA-assisted designs, are optimized through advanced transport modes and workload-aware scheduling.
RDMA-enabled memory disaggregation refers to the architectural and systems approach of decoupling server memory from compute nodes and exporting it transparently, elastically, and with low latency over a network via Remote Direct Memory Access (RDMA). In this paradigm, a portion of the memory on one or more "memory nodes" can be directly and efficiently accessed by other "compute nodes" in the cluster, leveraging RDMA's zero-copy, one-sided verbs to achieve high bandwidth and low latency comparable to local DRAM under favorable conditions. This topic spans hardware system architecture, OS support, fine-grained management and security, performance isolation, and workload-aware optimization strategies, as evidenced by a diverse body of recent research.
1. Hardware and System Architectures
RDMA-enabled disaggregated memory systems materialize through a separation of traditional server memory resources and compute resources, resulting in distinct but interconnected physical nodes. The hardware architectures fall into several categories:
- Hybrid DRAM+DCPM (e.g., Intel Optane DCPM) Servers: Commodity servers equipped with both DDR4 DRAM and high-capacity Intel Optane DC Persistent Memory modules, sharing the same memory controller channel, and exposing the DCPM region via InfiniBand RDMA (Oe, 2020). Experiments were conducted using Cascade-Lake Xeon nodes with Mellanox HCAs, mapping DCPM as App-Direct regions via device-dax.
- Pure Memory Pool Nodes: Standalone hosts with large DRAM pools and minimal or no CPU, equipped with RNICs. These are accessed over a high-speed network (InfiniBand, RoCEv2, or Ethernet) from compute nodes (Zhang et al., 23 May 2025, Ding et al., 2023).
- FPGA-assisted Designs for Security and Fine Granularity: Per-node FPGA boards (e.g., Xilinx Alveo U50) with on-board HBM and a trusted hardware-software protocol, supporting secure, fine-grained (4 KB-page) remote allocation and validation isolated at the hardware level (Heo et al., 2021).
- Advanced Physical-Layer Fabrics: Custom Ethernet PHY-integrated fabric (EDM) sidesteps the traditional MAC/IP stack by enacting remote-memory messaging and circuit-based in-network scheduling at the 66-bit block level, eliminating most per-hop latency (Su et al., 13 Nov 2024).
Typical deployments assume high-bandwidth (25–200 Gbps) links, direct point-to-point topologies for minimal hop count, and programmable NIC or FPGA agents for flexible protocol offload. Network-layer considerations include per-node injection bandwidth (b_inj), rack or system bisection bandwidth (b_bisect), and sophisticated congestion and QoS management.
2. RDMA Transport Modes, Protocols, and Fine-Grained Management
RDMA Transport and Primitives: All surveyed systems utilize one-sided RDMA verbs (READ, WRITE, sometimes FETCH-AND-ADD/CAS) for direct, CPU-bypassing access to memory regions exported from the memory nodes. Key protocol parameters affecting performance and interference include message size (2 KiB–64 KiB), access offset pattern (random or sequential), and concurrency degree (number of parallel QPs).
Registration and Memory Management: Traditional RDMA requires pinning and registration of the full participating memory region, constraining OS paging and elastic resource allocation. NP-RDMA eliminates this necessity by leveraging MMU-notifier callbacks and IOMMU indirection, allowing non-pinned, swappable remote memory with fallback to two-sided RDMA on page faults, and achieving near-native operation latency (+0.1–2 µs) except when faults occur (Shen et al., 2023).
Granularity and Security: Trusted disaggregation designs (TDMem (Heo et al., 2021)) manage memory at native page (4 KB) granularity, enforcing machine-level ownership in hardware permission tables resident in FPGA HBM. This enables secure, fine-grained export without trusting the OS on either side.
Obliviousness: Hardware mechanisms randomize page assignment and translation to hide memory access patterns, with concrete mechanisms such as random selection upon store, keeping translation tables, and coalescing invalidation operations to reduce bandwidth amplification ( in the TDMem scheme).
3. Interference, Performance Isolation, and Quantitative Models
The fundamental performance challenge arises from interference between local memory accesses and remote RDMA operations sharing the same physical memory channels:
- In hybrid DRAM+DCPM platforms, a remote RDMA write stream directed to DCPM drastically increases DCPM write latencies (0.7 to 0.9 µs), which indirectly throttles local DRAM throughput by over 50% when DCPM write latency exceeds 0.8 µs, even for purely local DRAM workloads (Oe, 2020).
- Interference is request-rate dependent and can be analytically captured:
where is baseline local throughput, is RDMA request rate, and is an empirically determined contention coefficient.
- The critical "knee" is observed beyond which write-heavy RDMA traffic (≥4 QPs) collapses local throughput (e.g., W5 sequential pattern on DCPM drops to <20% of baseline).
- In multi-tenant (shared fabric) deployments, scheduling and throttling via monitoring MC-channel occupancy or restrictively binding RDMA traffic to a subset of HCAs greatly mitigates this effect. Throttling RDMA to a single HCA recovers local throughput from 18% to >85% in write-only scenarios. Read-only RDMA at ≤4 QPs operates almost interference-free.
Performance metrics in system evaluation encompass: per-thread/aggregate throughput (MB/s–GB/s), read/write latencies (ns–µs), tail-latency distributions, and bandwidth scaling with increasing request concurrency and network load. In fine-grained secure environments, an end-to-end far-memory page miss involves 10–20 µs total (host SW, PCIe, net, HW checks), with bandwidth in TDMem reaching 1358 MB/s (plain, HBM), 91.7% of the “fastswap” RDMA baseline.
4. Scalability, Locking Protocols, and Workload Placement
As the number of clients or distributed locks rises (e.g., for database indexes, shared-memory services), naive RDMA primitives become the limiting factor:
- Traditional RDMA spinlocks: Each client performs repeated CAS retries, with average retries ; aggregate MN-NIC load is , saturating IOPS at moderate scale (e.g., 16M IOPS NIC saturated at clients) (Zhang et al., 23 May 2025).
- MCS/ShiftLock: Chaining or hierarchical locks reduce mean operations but still do not prevent MN-NIC overload under contention.
- DecLock: Decouples lock queuing (single FAA enqueue + publish) from notification, using at most two one-sided MN ops per acquire/release, and pushes the transfer of ownership to clients using two-sided (SEND/RECV) notifications, yielding strict FIFO fairness. Application benchmarks show up to 43.4× throughput improvement over spinlocks, with 99th percentile latency reduced by 98.2% (Zhang et al., 23 May 2025).
Workload placement and scheduling guidelines strongly favor partitioning memory channels (or at least isolating them at the firmware/OS level), scheduling write-heavy RDMA tasks during system off-peak, and limiting write ratio for mixed workloads co-located with latency-critical tasks.
5. Advanced Fabrics, Next-Generation Protocols, and Hardware Trends
Efforts to further minimize remote memory access latency go beyond standard RDMA-on-Ethernet/InfiniBand:
- PHY-level Disaggregated Fabrics (EDM): EDM (Su et al., 13 Nov 2024) implements the entire remote-memory protocol in the Ethernet PHY, bypassing MAC/IP/RDMA to achieve per-read latencies of 299.4 ns versus 2.03 µs for RoCEv2, and uses in-switch parallel matching (PIM algorithm) for scheduling, maintaining latency within 1.2× unloaded for load up to 95% link saturation.
- CXL-over-Ethernet Hybrids: FPGA-based designs that combine native CXL load/store semantics locally with an RDMA-inspired Ethernet transport (with on-FPGA cache and ARQ/congestion FSM), yielding average access latencies as low as 415 ns on cache hits and 1.97 µs on misses—about 37% lower than standard RDMA (Wang et al., 2023).
- Programmable In-memory Compute Fabrics (NetDAM): Exporting memory directly from FPGA or ASIC, bypassing the host, and supporting a SIMD-style in-memory ISA for compute-heavy data movement or reductions in a single pass (e.g., MPI_Allreduce), attaining sub-microsecond latency and up to 0.4 s for 2 GiB Allreduce (versus 2.1–2.8 s for CPU+RoCE) (Fang et al., 2021).
6. Methodologies, Application Domains, and Design Space Exploration
General deployment methodology in HPC and data-intensive domains (Ding et al., 2023) considers:
- Injection bandwidth () per node: must be provisioned according to worst-case local:remote (L:R) access ratios. For HBM3-local at 1 TB/s and 100 GB/s PCIe-6 NIC, intra-rack remote memory can serve workloads with L:R > 10 without incurring bandwidth penalties.
- Bisection bandwidth (): defines the scaling penalty for global (cross-rack) memory usage; only applications with very large remote fractions (e.g., SuperLU, GEMM) suffer system-wide taper penalties.
- Workload balance, page size, and software integration (libfabric/Verbs API, user-level paging libraries) inform configuration for both explicit remote memory access and transparent NUMA-domain expansion.
Application domains include OLTP/OLAP databases with DSM-DB layers (Wang et al., 2022), where RDMA primitives map directly to distributed heaps, lock metadata, and function offloads. Challenges here include address translation, durability, software cache coherence, and distributed concurrency control.
7. Limitations, Future Work, and Open Challenges
Key open issues identified across studies involve:
- Interference-aware scheduling and hardware QoS: Need for memory-controller-level admission and isolation mechanisms for RDMA flows.
- Multi-tenant security and trust: Hardware-hardened access tables (TDMem), secure enclaves, and oblivious allocation to resist side-channel exposure.
- Scalable page-fault and swap handling: NP-RDMA-style MMU/IOMMU notification models versus on-NIC ODP (hardware paging), with extensions needed for multi-tenant scale-out and reduced per-page metadata cost.
- Analytical and empirical modeling for resource procurement: Systematic profiling of L:R, workload memory footprints, and cost/performance modeling are indispensable for cluster architecture (Ding et al., 2023).
- Integration of CXL, RDMA, and emerging fabrics: Potential convergence as CXL devices and ultra-low-latency Ethernet fabrics evolve towards unified, coherent, and secure remote memory pools.
Memory disaggregation with RDMA, especially when enhanced with hardware-layer scheduling, security, and architectural isolation, offers a rigorous path towards high-performance, multiplexed memory pools in modern datacenters. However, interference, security, and tail-latency effects remain persistent research frontiers, necessitating continued system-level innovation and cross-layer design.