Remote Direct Memory Access (RDMA)

Updated 20 December 2025

Remote Direct Memory Access (RDMA) is a high-performance networking technology that bypasses the CPU and OS to enable direct memory operations over high-speed networks.
It supports both one-sided and two-sided communication primitives, offering sub-microsecond latencies and multi-GB/s throughput through protocols like InfiniBand and RoCE.
Recent advances in RDMA include efficient synchronization mechanisms, offload programmability with SmartNICs, and optimized resource management for scalable datacenter applications.

Remote Direct Memory Access (RDMA) is a high-performance networking technology enabling direct memory operations—read, write, and atomic updates—between servers without CPU or OS intervention on the remote side. This protocol underpins modern datacenter, HPC, and AI/ML workloads, supporting sub-microsecond latency and multi-GB/s throughput by leveraging zero-copy DMA transport and host networking bypass. RDMA operates on registered memory regions and exposes both one-sided (hardware-driven, CPU-bypassing) and two-sided (CPU-involving, RPC-style) primitives, with associated protection domains, queue pairs, and completion queues. The following sections survey RDMA fundamentals, programming paradigms, synchronization challenges and solutions, data-structure design, offload programmability, and recent advances in control-plane orchestration.

1. RDMA Fundamentals and Transport Architecture

RDMA achieves network-accelerated memory access by pinning host buffers and directly DMA-ing data over high-speed fabrics (InfiniBand, RoCE v2, iWARP, custom UDP) via host bus adapters or SmartNICs (Heer et al., 27 Jul 2025, Lenkiewicz et al., 2017). Memory regions (MRs) are made accessible by registration, yielding local (lkey) and remote (rkey) keys; queue pairs (QPs) for send/receive operations are set up in protection domains enforcing isolation.

Key primitives:

RDMA_READ/WRITE: One-sided verbs for direct remote buffer access. Hardware executes transfers from origin to target memory without waking the target CPU.
Atomic Verbs: 32- or 64-bit compare-and-swap (CAS), fetch-and-add (FAA), available on one-sided verbs for lock management, counters, and distributed synchronization (Chung et al., 2015, Baran et al., 27 Apr 2024, Nelson-Slivon et al., 2022).
SEND/RECV: Two-sided, CPU-involving verbs, akin to RPC or active messages.
Completion Queues: WR completions are posted for polling to indicate operation success or failure.

High-performance stacks (e.g., RoCE BALBOA on FPGA, SoftiWARP-UDP for energy efficiency) demonstrate line-rate throughput (100Gb/s), hardware CRC offload, dynamic queue-pair (QP) management, and direct packetization (Heer et al., 27 Jul 2025, Lenkiewicz et al., 2017).

2. Programming Models: One-Sided, Two-Sided, and Active Extensions

RDMA is programmed at three main abstraction levels:

One-Sided RMA/RDMA: Direct Put/Get/Atomic to remote memory. Used in MPI-3 One-Sided, UPC, or Coarrays; zero-buffer protocols scale to half a million cores (Gerstenberger et al., 2020). Enables highly efficient latency ( $\sim$ 6.5μs off-node put/get), overlapping computation and communication, and fine-grained memory placement.
Two-Sided SEND/RECV/RPC: RPC-style operations where the remote process posts a receive request and executes handler code. Offers high expressiveness but incurs server CPU intervention and increased latency ( $\sim$ 6–8μs for request+handler+reply) (Brock et al., 2019).
Active Messaging/Active Access: Extends one-sided RMA with IOMMU-triggered handlers (Active Access) or fully programmable eBPF or custom offloads (NAAM, RedN), enabling arbitrary server-side logic upon memory access without host interrupts (Besta et al., 2019, Reda et al., 2021, Rahaman et al., 9 Sep 2025). NAAM supports portable eBPF handlers dynamically steerable between host, NIC, or client CPUs and scales to 128 tenant functions per NIC (Rahaman et al., 9 Sep 2025).

Table: Comparison of Key Programming Models

Model	CPU Involvement	Expressiveness	Typical Latency
One-Sided (RDMA)	Remote: none	Primitive	3–4 μs
Two-Sided (RPC)	Required	Arbitrary	6–8 μs
Active Messages	Handler-only	Arbitrary	$\sim$ 7 μs, scalable
NAAM (eBPF offload)	Selective	Arbitrary	15–38 μs (tail)

3. Synchronization and Mutual Exclusion Mechanisms

Synchronization over RDMA is challenging due to operation asymmetry and weak atomicity between local (CPU) and remote (RDMA) accesses. One-sided atomic operations (CAS, FAA) are atomic only with other RDMA ops on the same cache line, not with CPU instructions, complicating the construction of correct distributed locks (Baran et al., 27 Apr 2024, Nelson-Slivon et al., 2022). The classic workaround—forcing both local and remote threads to use RDMA loopback—induces severe PCIe congestion and QP cache thrashing.

Recent advances include asymmetric hierarchical locks (ALock) that partition contenders into local (shared-memory) and remote (RDMA) cohorts, each using MCS queue locks, and a modified Peterson lock mediating cohort arbitration. This approach removes the need for loopback on locals, limits remote spinning to O(1) RDMA calls per entry/exit, and achieves up to 29 $\times$ higher throughput and 20 $\times$ lower latency under mixed workload scenarios (Baran et al., 27 Apr 2024). Verification frameworks formalize weak RDMA memory consistency and guarantee mutual exclusion and fairness (Ambal et al., 12 Oct 2025, Nelson-Slivon et al., 2022).

4. High-Performance Data Structures and Consistency Optimization

RDMA enables distributed data structures that leverage hardware primitives for efficient put/get, atomic updates, and reduced server CPU load. For hash tables, queues, and key-value stores, careful composition of RPUT, RGET, and atomics yields superior performance versus RPC approaches, when operations can be encoded in $\leq$ 2 round trips (Brock et al., 2019, Chung et al., 2015, Liu et al., 2019).

Write-optimized systems (e.g., Erda) combine log-structured memory layouts, 8-byte atomic metadata updates, and CRC32 checksums for remote data atomicity and failure recovery. Experimental validations show 50% reduction in NVM writes and substantial improvements in throughput/latency over redo-logging or read-after-write schemes (Liu et al., 2019).

For lock management, pure client-centric one-sided designs (CAS/FAA directly on lock arrays) eliminate all server threads and achieve $>$ 2M locks/s throughput compared to $<$ 40K on TCP (Chung et al., 2015). Remote Fetching Paradigm (RFP) exploits the observed %%%%8 $\leq$ 9%%%% asymmetry between inbound RDMA-read and outbound RDMA-write IOPS, shifting result fetching to clients and improving key-value store IOPS by 160–310% (Su et al., 2015).

Table: RDMA Primitives for Data Structures

Primitive	Example Usage	Latency (μs)
RPUT	Queue/Insert/Write	~3.0
RGET	Find/Lookup/Read	~3.7
CAS/FAA	Lock/Atomic counter	~3.8–3.9

5. Offload Programmability, Turing-Completeness, and SmartNIC Integration

Programmable offloads push application logic into network and NIC hardware without host CPU or OS involvement. RedN demonstrates that the existing verbs interface (Read/Write/CAS/Wait/Enable) on standard RNICs is Turing-complete: self-modifying work request chains allow conditional execution, loops, and memory traversal, supporting arbitrary data-dependent control flow without hardware modification (Reda et al., 2021). Real systems are integrated into Memcached, yielding up to 2.6 $\times$ lower latency over baseline RDMA one-sided designs and 35 $\times$ lower tail latency under CPU contention.

NAAM and RoCE BALBOA further expand offload capability by running eBPF handlers or ML/DPI pipelines on SmartNICs, delivering multi-million ops/sec for hash/B-tree lookups, line-rate encryption, and deep packet inspection with negligible impact on latency (Rahaman et al., 9 Sep 2025, Heer et al., 27 Jul 2025). BALBOA matches commercial NIC performance at 100Gb/s while supporting protocol-level compute extensions and flexible queue management.

6. Resource Management, Page Fault Handling, and Elastic/Serverless Control Planes

Efficient resource orchestration is critical for elasticity and serverless paradigms. Prior assumptions of slow user-space RDMA control planes and unshareable resources (QPs/PDs/MRs) are overturned: caching internal libibverbs calls reduces connection setup from $\sim$ 26.5ms to $\sim$ 2.18ms, and fork-based resource sharing (ibv_fork_init) enables sub-millisecond handoffs, facilitating high-throughput transient workloads in containers without kernel modules or patched OSs (Zhang et al., 31 Jan 2025).

Page fault handling in virtual-address RDMA is addressed with integrated hardware-software mechanisms using ARM SMMU interrupt-driven fault detection, FIFO tracking, and selective retransmission. Compared to pinning or pre-faulting, this approach reduces programming complexity and memory overhead while maintaining near-zero-copy performance for steady-state transfers (Psistakis, 26 Nov 2025).

7. Advanced Active Access, Virtualization, and Verification Frameworks

Active Access schemes minimally extend IOMMU translation and logging infrastructure to bind handler functions to memory pages, allowing invocation of computation on data arrival (Active Put/Get). This approach achieves up to 3 $\times$ higher throughput for data-centric distributed workloads and enables protocols for logging, checkpointing, and large-scale virtualized address spaces (Besta et al., 2019). Extended page tables and hardware CAM buffers pave the way for disaggregated, software-transparent global address spaces, augmenting conventional PGAS and MMU virtualization (Besta et al., 2019).

Verification frameworks (e.g., Mowgli) model the weak RDMA memory model and compose modular correctness proofs for distributed object libraries (LOCO), supporting efficient barrier, broadcast, and key-value objects with formal safety and performance guarantees. These abstractions yield performance on par or exceeding hand-tuned systems, with modest overhead (<20%) and predictable semantics (Ambal et al., 12 Oct 2025).

In summary, RDMA is a cornerstone technology for high-throughput, low-latency distributed memory access, enabling scalable parallel computation, data-structure acceleration, and rich offload programmability. Significant advances in memory consistency formalization, mutual exclusion mechanisms, hardware/software co-design, and resource management continue to extend RDMA’s applications in next-generation datacenter and HPC platforms.