Papers
Topics
Authors
Recent
Search
2000 character limit reached

RDMA Heterogeneous P2P Networking

Updated 26 April 2026
  • RDMA-based Heterogeneous P2P is a framework enabling direct zero-copy data transfers between heterogeneous devices like GPUs and CPUs using specialized hardware and protocols.
  • It integrates GPU memory into RDMA operations to bypass CPU staging, resulting in lower latency and improved throughput in petascale HPC applications.
  • Systems like APEnet+ employ FPGA-based NICs and a 3D torus topology to achieve scalable, efficient, and fault-tolerant communication across cluster nodes.

Remote Direct Memory Access (RDMA)-based Heterogeneous Peer-to-Peer (P2P) networking encompasses direct, zero-copy, low-latency data transfers between heterogeneous compute devices (notably GPUs and CPUs) in cluster environments, orchestrated through RDMA semantics and adapted hardware/software layers. This approach eliminates CPU staging and main memory bottlenecks, facilitating direct memory accesses across devices—often spanning multiple nodes—using cluster network adapters equipped with specialized FPGA logic. The principal research focus has been on extending RDMA from traditional host DRAM to GPU memory, with fundamental advances realized in the APEnet+ project, leveraging a 3D toroidal network architecture, bespoke FPGA NICs, and a CUDA/OpenMPI-aware software stack for petascale HPC workloads such as Lattice QCD and graph analytics (Ammendola et al., 2010, Ammendola et al., 2013).

1. System and Network Architecture

RDMA-based heterogeneous P2P is exemplified by the APEnet+ platform, where each compute node—potentially containing both multicore CPUs and GPUs—interfaces with custom APElink+ FPGA-based NICs. These NICs feature six full-duplex, bidirectional channels that implement a 3D-torus topology, with the total nodes given by

N=X×Y×ZN = X \times Y \times Z

for X×Y×ZX \times Y \times Z nodes along the three Cartesian axes (Ammendola et al., 2010). Each link consists of four bonded 8.5 Gb/s lanes, yielding a raw per-link bandwidth of 4×8.5=344 \times 8.5 = 34 Gb/s (approximately 4 GB/s per direction); after accounting for protocol overhead, sustained payload bandwidth per direction is approximately 3.8 GB/s.

Routing employs dimension-ordered wormhole switching with two virtual channels per link to guarantee deadlock freedom. Bi-sectional throughput grows as N2/3N^{2/3}, while cost scales linearly in NN, necessitating only one adapter and three QSFP+ cables per node. The architecture supports direct NIC–NIC communication for RDMA operations targeting both host and device memory.

2. RDMA Semantics and GPU Memory Integration

Extending RDMA to heterogeneous P2P communication requires specialized hardware and protocol support on both the NIC and device driver levels. In the APEnet+ system, two principal hardware pipelines exist:

  • The injection pipeline handles scatter–gather descriptors, fragments large transfers, packetizes payloads, computes CRC-32, and queues them for network transmission.
  • The ejection pipeline receives RDMA PUT/GET responses and performs DMA writes into pre-registered host (or, with P2P extensions, GPU) buffers (Ammendola et al., 2010, Ammendola et al., 2013).

To enable GPU memory as the RDMA target/source, the NIC maintains a "GPU_V2P" (virtual-to-physical) translation unit for each GPU, using a four-level page table indexed by 64 KB "P2P pages." Upon memory registration—triggered by extended RDMA driver verbs (e.g., ibv_reg_mr_gpu)—per-page physical addresses and NVIDIA P2P tokens are extracted (via CUDA APIs such as cuPointerGetAttribute) and pushed to the NIC via the onboard NIOS II microcontroller, which also manages rkey/lkey namespaces.

Incoming or outgoing RDMA operations involving device memory use extended descriptors and opcodes (IBV_SEND_GPU_WRITE, IBV_SEND_GPU_READ), and retain the 64-bit GPU virtual address throughout. The NIC validates rkeys, translates addresses via hardware in four cycles, and directly issues PCIe reads/writes to the specified GPU memory.

3. Programming Models and Software Stack

The APEnet+ software stack comprises four layers:

  • NIOS II firmware on the FPGA NIC for buffer key lookup, page-table management, and evolving collective offload functionality.
  • Linux kernel module exposing memory registration IOCTLs, event queues, and work queues for posting RDMA commands.
  • User-level C library offering standard and GPU-aware RDMA verbs:
    1
    2
    3
    4
    5
    
    int rdma_put(int peer, uint32_t lkey, uint64_t laddr,
                 uint32_t rkey, uint64_t raddr, size_t size);
    int cuda_rdma_put(cudaBuffer_t *d_buf, uint32_t lkey_remote,
                      uint64_t raddr_remote, size_t nbytes,
                      cudaStream_t stream);
    with register_buffer()/unregister_buffer() for both host and device memory (Ammendola et al., 2010).
  • OpenMPI and MVAPICH2 integrations ("CUDA-aware" MPI) that detect GPU pointers and transparently invoke GPU-enabled RDMA verbs, supporting nonblocking, zero-copy, peer-to-peer MPI_Isend/MPI_Irecv operations (Ammendola et al., 2013).

Events can be delivered from the NIC to either host threads or device-resident CUDA kernels (by mapping event queues into GPU address space), minimizing kernel launch and host-interrupt latency.

4. Performance Characterization

Microbenchmark results demonstrate that P2P RDMA halves latency for small transfers relative to host-staged approaches. Representative data for APEnet+ (Ammendola et al., 2013):

Method Latency (µs) Speedup vs. Host-Staged Max BW (GB/s)
P2P PUT (APEnet+) 8.2 1.1
Staged (cudaMemcpy) 16.8 0.6–0.7
InfiniBand (MV2) 17.4 1.1

For larger messages (>32 KB), host-staged methods may recover some bandwidth due to a P2P PCIe read limit (1.5 GB/s on NVIDIA Fermi; 1.6 GB/s on Kepler’s BAR1, but at size restrictions).

Projected large-scale performance draws from the model

Beff(s,Nh)=sL0+NhΔhop+s/BlinkB_{\rm eff}(s,N_h) = \frac{s}{L_0 + N_h\,\Delta_{\rm hop} + s/B_{\rm link}}

For application-level benchmarks, full P2P yields modest performance improvements (10–28%) on multi-GPU stencil (Heisenberg Spin Glass) codes, with greater advantages for communication-dominated workloads (Ammendola et al., 2013). Breadth-first search (Graph500) tests reveal up to 60% throughput improvement over InfiniBand at small node counts; network topology and routing become limiting factors at larger scales.

5. Scalability and Fault Tolerance

The torus topology with direct NIC–NIC connections enables high bisectional bandwidth, linear cost scaling, and direct adaptation to regular communication patterns such as those in LQCD stencil algorithms. Fault-tolerant features include:

  • Dimension-ordered routing for automatic bypass of failed links.
  • Credit-based flow control, preventing head-of-line blocking.
  • Planned on-NIC Ethernet multicast/reduction engines for scalable collective operations (Ammendola et al., 2010).

Limitations emerge for irregular communication patterns (e.g., graph-based workloads), motivating preprocessing via space-filling curve mapping to improve locality alignment on the torus fabric.

6. Practical Constraints and Current Limitations

Peer-to-peer RDMA functionality is constrained by both hardware and system architecture:

  • Peer-to-peer operation requires colocation of GPUs and NIC under the same PCIe root complex/switch; dual-socket x86 systems may experience inter-socket latency or outright incompatibility.
  • Fermi-class GPU P2P-read protocols are latency intensive and capped at 1.5 GB/s.
  • Kepler BAR1 addresses improve bandwidth but restrict available addressable space.
  • Microcontroller bottlenecks (e.g., NIOS II on APEnet+) currently limit RX throughput to ~1.2 GB/s, with ongoing hardware redesign toward fully hardwired page-table logic.
  • Registration overhead, Oreg500μsO_{\mathrm{reg}} \sim 500\,\mu\text{s}, and CUDA context-switching may dominate costs for short or frequent transfers. Persistent registration caching and advances in GPU DMA engines are under active development (Ammendola et al., 2013).

7. Future Prospects and Research Trajectories

Key areas of ongoing and future investigation include:

  • Full exposure of GPU memory as RDMA targets for both GET and PUT, with the goal of completely host-bypass for intra- and inter-node GPU–GPU data movement.
  • Hardware delivery of RDMA completion events into CUDA kernels for sub-microsecond kernel–kernel handshakes.
  • NIC-offloaded collective operations tailored to stencil codes and global reductions, amortizing toroidal network latency and congestion effects.
  • Expansion of BAR1 and on-GPU buffer registration mechanisms, and co-design of next-generation network adapters for deep integration with future exascale server GPUs and their memory models (Ammendola et al., 2010, Ammendola et al., 2013).

These developments collectively define the trajectory toward “zero-copy” and “zero-overhead” RDMA-based heterogeneous P2P at scale, supporting both tightly coupled scientific simulation and data-driven workloads.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RDMA-Based Heterogeneous Peer-to-Peer (P2P).