Papers
Topics
Authors
Recent
Search
2000 character limit reached

GDAKI Backend: Direct GPU-Initiated RDMA

Updated 7 March 2026
  • GDAKI Backend is a hardware-optimized network layer that enables direct GPU-initiated one-sided RDMA using DOCA GPUNetIO and NVIDIA ConnectX-6 Dx, eliminating CPU overhead.
  • It constructs work queue entries directly in GPU memory to support asynchronous, low-latency communication ideal for high-performance AI workloads.
  • Designed for systems with CUDA 12.2+ and compatible NICs, GDAKI delivers around 16.7 μs ping-pong latency and scalable throughput across GPU clusters.

The GPUDirect Async Kernel-Initiated (GDAKI) backend is the hardware-optimized network layer within the GPU-Initiated Networking (GIN) extension of NCCL 2.28, enabling direct GPU-resident CUDA kernels to issue one-sided RDMA operations to remote peers with no CPU involvement after initialization. By leveraging DOCA GPUNetIO and NVIDIA ConnectX-6 Dx or later NICs, GDAKI delivers minimal-latency, high-throughput communication within the NCCL (NVIDIA Collective Communications Library) runtime, targeting advanced AI workloads such as Mixture-of-Experts (MoE) at scale (Hamidouche et al., 19 Nov 2025).

1. Architectural Overview

GDAKI resides in the GIN plugin layer, implementing the device-to-NIC data path critical for device-initiated RDMA. The GIN architecture consists of three stacked layers:

  • NCCL Core (Host-side): Responsible for communicator creation (ncclDevCommCreate, ncclCommInitRank) and collective memory window registration (ncclCommWindowRegister). This layer disseminates RDMA metadata and abstracts communicator state.
  • Device GIN API (CUDA-callable): Exposed via the ncclGin C++ class, providing methods for remote memory operations (e.g., put, signal, flush) and synchronization primitives callable directly from CUDA device code.
  • GIN Network Plugin: At the lowest level, GDAKI links GPU memories to NICs with direct work queue entry (WQE) management over mapped PCIe BARs, using the DOCA GPUNetIO library for device-verbs. The alternative Proxy backend routes GPU requests for RDMA via lock-free GPU→CPU queues and a CPU proxy thread, incurring higher latency.

During communicator initialization, NCCL probes for DOCA GPUNetIO support and conditionally loads the GDAKI backend if appropriate hardware and software (CUDA ≥12.2, GPUNetIO, Linux GPUDirect modules) are present.

2. GDAKI Dataflow and Mechanisms

With GDAKI, CUDA device code constructs WQEs in GPU memory, encoding source/destination addresses, transfer sizes, and remote keys. The kernel notifies the NIC using a direct store to a memory-mapped doorbell register in the device BAR. The NIC then autonomously polls for new WQEs, executes the programmed RDMA PUT/GET operation, and writes completions to GPU-visible queues. This architecture:

  • Removes all CPU involvement for posting or progressing network operations after setup.
  • Allows message rates and communication granularity tightly interleaved with GPU-side computation.
  • Requires: ConnectX-6 Dx (or newer), CUDA 12.2+, DOCA SDK, and compatible kernel modules.

This direct mechanism underpins the minimal-latency, high-message-rate capability of GDAKI, supporting fine-grained and asynchronous GPU-GPU communication.

3. API and Programming Model

The programming surface exposed by GDAKI through NCCL's Device API centers on the ncclGin C++ class, which embodies the full set of device-callable communication and synchronization primitives:

  • Data movement: put, putValue enable remote writes (PUT) of arbitrary size/data, naming remote buffers by communicatively shared memory window structures (ncclWindow_t).
  • Synchronization: signal, flush, and associated counter/signal reading and waiting (readCounter, waitCounter, readSignal, waitSignal) enable precise ordering and completion semantics.
  • Barriers: The ncclGinBarrierSession class provides team-wide collective synchronization.

PUT operations to the same peer/context are unordered; a matching SIGNAL operation supplies ordering, where the arrival of a signal value ≥N ensures visibility of all preceding PUTs. All calls return immediately for asynchronous progression, with remote completion signaled explicitly.

A representative kernel invoking a device-initiated ring exchange is as follows:

1
2
3
4
5
6
7
__global__ void ringExchange(ncclDevComm_t devComm, ncclWindow_t sendWin, ncclWindow_t recvWin, size_t dataSize, int myRank) {
  ncclGin gin(devComm, 0);
  int peer = (myRank + 1) % devComm.nRanks;
  gin.put(ncclTeamWorld(devComm), peer, recvWin, myRank*dataSize, sendWin, peer*dataSize, dataSize, ncclGin_SignalInc{0});
  gin.waitSignal(ncclCoopCta(), 0, 1);
  gin.resetSignal(0);
}

4. Performance and Scaling Characteristics

GDAKI achieves the lowest observed GPU-to-GPU latencies among supported NCCL GIN backends, with ping-pong round-trips measured at approximately 16.7 μs for 4–128 byte messages, and effective RDMA bandwidth approaching 54 GB/s at scale (Hamidouche et al., 19 Nov 2025). The two-parameter latency model,

L(S)=α+S/BL(S) = \alpha + S / B

where α8.3\alpha \approx 8.3 μs and B50B \approx 50 GB/s, models message-dependent timing.

In high-throughput MoE workloads (e.g., DeepEP “dispatch” BF16):

  • On 2 nodes (16 GPUs): 84.36 GB/s RDMA bandwidth for NCCL GIN, comparable to NVSHMEM-IBRC.
  • At 8 nodes (64 GPUs), both scale to 53 GB/s RDMA.

For low-latency hybrid NVLink + RDMA, dispatch times for 14 KB tokens are 40.6 μs–142.5 μs (1–2 node), whereas pure RDMA yields 160.8 μs–230 μs for dispatch/combine under one or two nodes. In all cases, GDAKI matches or slightly exceeds NVSHMEM IBGDA/IBRC by 1–10%.

5. Comparison with the Proxy Backend

A direct comparison highlights the architectural tradeoffs:

Characteristic GDAKI (GPUDirect Async Kernel-Initiated) Proxy Backend
Comm. Path GPU ↔ NIC via DOCA GPUNetIO GPU → CPU-proxy thread → NIC
CPU Involvement Zero after initialization One dedicated CPU thread
Progress Model NIC polls GPU-resident WQEs CPU polls GPU-visible queues
Latency Overhead Minimal (~16.7 μs RT small msgs) 1–2 μs penalty (~18.0 μs RT)
HW Requirements ConnectX-6 Dx+, CUDA 12.2+, DOCA Any RDMA NIC/GPU/CUDA
Portability NVIDIA NICs with GPUNetIO only Universal across vendors
Use Case Production HPC/AI clusters Development, fallback/mixed

GDAKI delivers 10–15% lower latency and eliminates all CPU runtime overhead, at the cost of requiring cutting-edge network hardware and software. Proxy mode provides functionally equivalent semantics on legacy or mixed fabrics at a modest latency increase.

6. DeepEP Integration and Communication Patterns

The integration of GDAKI into DeepEP, an MoE-optimized communication library, demonstrates mapping from NVSHMEM/IBGDA pointer-based primitives to GIN’s window/offset addressing and signal-based synchronization:

  • put_nbi and atomic head/tail updates become put and signal operations.
  • Buffer registration occurs per communicator, with window handles distributed as global GPU arrays indexed by kernel context.
  • DeepEP high-throughput kernels require multi-context QP mapping: given 4 GIN contexts per communicator, multiple communicators are created so that up to 24 QPs can be simultaneously driven, maximizing parallelism.

Operational recommendations include co-locating NICs and GPUs on the same PCIe root complex, pre-registering RDMA buffers, assigning communication contexts per expert/channel, using the ncclGin_SignalInc facility for efficient in-batch ordering, balancing QP utilization, and explicit signaling and counter resets across iterations.

7. Implications and Deployment Considerations

GDAKI empowers fully device-driven networking for NCCL workloads, eliminating CPU bottlenecks and harmonizing with collective algorithm semantics and production infrastructure. Its performance, architectural integration, and direct device-verbs support are suited to high-end HPC and AI clusters equipped with the requisite NVIDIA hardware and software stack. The Proxy backend affords broader compatibility for development and legacy systems, trading off some latency for universality.

A plausible implication is that, as network and GPU architectures continue to evolve, adoption of direct GPU-initiated RDMA as exemplified by GDAKI will become increasingly prevalent for scalable distributed AI, particularly wherever fine-grained communication and low tail latency are critical (Hamidouche et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GDAKI Backend.