NCCL GIN: GDAKI Backend for GPU Networking

Updated 19 January 2026

NCCL GIN (GDAKI Backend) is a GPU-initiated networking extension that enables one-sided RDMA using DOCA GPUNetIO, eliminating CPU intervention in GPU-to-GPU communication.
It features a three-layer architecture combining a host-side NCCL core, a device-side GIN API with non-blocking primitives, and a network plugin supporting both direct and proxy modes.
Empirical benchmarks demonstrate sub-17 μs latency and multi-tens of GB/s bandwidth, making it ideal for modern AI workloads such as Mixture-of-Experts applications.

NCCL GIN (GDAKI Backend) refers to the GPU-Initiated Networking (GIN) extension in NCCL 2.28, which enables fully device-driven, one-sided RDMA using the GPUDirect Async Kernel-Initiated (GDAKI) backend. This architecture eliminates CPU intervention in GPU-to-GPU network communication, permitting CUDA kernels to initiate, progress, and synchronize fine-grained transfers over InfiniBand or RoCE directly from the device. Developed to address the requirements of modern AI workloads such as Mixture-of-Experts (MoE), GIN GDAKI delivers low-latency and high-bandwidth communication that is tightly integrated with NCCL’s established collective operation infrastructure (Hamidouche et al., 19 Nov 2025).

1. Three-Layer Architecture of NCCL GIN and GDAKI Placement

NCCL GIN is structured as a three-layer architecture:

NCCL Core (host-side): Manages communicator creation (via ncclDevCommCreate), collective memory window registration (ncclCommWindowRegister), resource exchange, and GIN backend selection.
Device-GIN API (GPU-side): Provides primitives—put, putValue, signal, local and remote completion counters/signals, flush, and barrier—callable directly from CUDA kernels.
GIN Network Plugin (backend): Implements the active RDMA transport. There are two plugin modes:
- GDAKI: Employs DOCA GPUNetIO to enable direct construction and posting of InfiniBand or RoCE work queue entries (WQEs), doorbell notification, and completion queue management, all from device code.
- Proxy: Relays device requests as 64-byte descriptors via lock-free queues to a host thread, which posts verbs and updates completions.

GDAKI operates entirely within the network-plugin layer. If DOCA GPUNetIO is detected at communicator initialization, the GDAKI plugin (libnccl-net-gdaki.so) is loaded, ensuring all device API calls issue pure device-initiated communication with no CPU roundtrips.

2. Device-Side GDAKI API Semantics

GDAKI device API operations are strictly non-blocking, returning immediately to encourage computation/communication overlap. Completion is tracked via local counters or remote signals, all accessible from device memory. The key API elements include:

Remote put: put(team, peer, dstWin, dstOff, srcWin, srcOff, bytes, sig, ctr)
Zero-byte remote notification: signal(team, peer, signalId)
Local completion management: counters and flush
Remote completion management: signals and their polling/resetting
Example (ring exchange in CUDA) demonstrates a peer-to-peer put followed by waiting on a remote signal and resetting it.

Semantically, a put generates a WQE in GPU-addressable BAR memory, followed by a doorbell write to the NIC’s register space. Completion is established either through local counter incrementation (modulo CQ) or remotely, when the corresponding signal counter is visible to the recipient GPU.

3. GDAKI Plugin internals and DOCA GPUNetIO

The GDAKI plugin, distributed as libnccl-net-gdaki.so, implements NCCL’s network API in "direct" device mode using DOCA GPUNetIO:

Resource Allocation: Each context allocates specific NIC-side resources (queue pairs, completion queues) for the GPU via DOCA device verbs.
WQE Construction: CUDA kernels format RDMA work requests into a BAR-mapped, GPU-accessible ring buffer.
Doorbell Mechanism: Device code stores to the mapped NIC doorbell register to initiate DMA.
Completion Processing: NIC hardware autonomously executes WQEs, conducts RDMA transfers, and writes CQEs into a completion queue mappable into GPU address space.
Signal/Counter Updates: Local and remote completions are tracked by 64-bit counters (either on local or peer GPU BAR-mapped memory), which are polled or signaled by kernels.

In steady-state, this pipeline requires no active host thread involvement. Control flow, synchronization, and completion signaling are entirely managed by GPU code in conjunction with BAR-mapped device/NIC resources.

4. Performance Modeling and Empirical Results

RDMA latency and throughput under GDAKI are analytically and empirically characterized. Latency is modeled as:

$T_{\mathrm{RTT}}(n) = \alpha + \beta n$

where $\alpha$ denotes round-trip doorbell/NIC/network fixed costs, and $\beta$ is the inverse bandwidth slope.

Microbenchmarks: For 4–128 B ping-pong between two H100s, GDAKI yields $\alpha_\mathrm{GDAKI}\approx16.7\ \mu{\rm s}$ .
Bandwidth: Bulk transfer saturates the NIC’s port rate (e.g., 50 GB/s).
DeepEP MoE Kernels: Throughput for $N_{\mathrm{ops}}$ put calls over total kernel execution time $T_{\mathrm{total}}$ is:

$BW = \frac{N_{\mathrm{bytes}} \times N_{\mathrm{ops}}}{T_{\mathrm{total}}}$

Benchmark Table

Test Scenario	NCCL GIN (GDAKI)	NVSHMEM IBGDA
2-node dispatch BW (16 GPUs, BF16)	84.36 GB/s	84.97 GB/s
8-node combine BW (64 GPUs, BF16)	53.1 GB/s	54.0 GB/s
Single-node dispatch latency (8 GPUs, 14 KB)	40.62 μs	41.43 μs
Two-node dispatch latency (14 KB)	142.5 μs	157.0 μs

All benchmarks were conducted on EOS DGX H100 clusters (ConnectX-7 InfiniBand, CUDA 12.x, NCCL 2.28, NVSHMEM 3.4.5).

5. Practical Integration Steps and Supported Workflows

For the GDAKI backend, system and integration requirements are:

Build: CUDA 12.2+, ConnectX-6 Dx+ NICs, OFED/MOFED, and kernel modules (nv_peer_mem or dmabuf).
Runtime: Set NCCL_GIN_BACKEND=gdaki to force direct device mode.
Communication Path Setup:
- Host-side: Create device communicator with ncclDevCommCreate, register memory windows collectively.
- Device-side: In CUDA kernels, instantiate ncclGin objects and invoke put, signal, and waitSignal methods.
- Cleanup: ncclCommWindowUnregister, ncclDevCommDestroy, and plugin teardown.

For MoE libraries such as DeepEP:

Each expert-parallel communication channel maps to a dedicated GIN context (typically 4 per communicator).
Kernels use (window, offset) pairs rather than raw pointers to identify buffers for RDMA.
Circular buffer management is accomplished with put and signalAdd, coordinated through remote head/tail signals.

6. Constraints, Compatibilities, and Roadmap

Hardware/Software: GDAKI requires NICs that export GPU-accessible doorbells and BAR-mapped WQE/CQ buffers (ConnectX-6 Dx or newer), CUDA 12.2+, and optimal results when NIC and GPU share a PCIe root complex.
Fallback: The Proxy backend supports a broader range of NICs and GPUs (Volta+), sacrificing only modest extra latency.
Current limitations:
- Identical window size enforcement across ranks (NCCL 2.28).
- No WQE batching; each put rings the NIC doorbell individually.
- Static, bounded number of signal/counter IDs per communicator.
Planned enhancements:
- Batched doorbell notifications to reduce per-operation cost.
- New one-sided primitives (e.g., remote atomics, TMA offload).
- Context/QP multiplexing and adaptive NIC load balancing.
- Fused collectives using device-primitives within NCCL’s algorithmic structures.

7. Significance and Areas of Application

GIN’s GDAKI backend brings zero-CPU-overhead, direct device-to-network offload to NCCL, achieving sub-17 μs single-message latencies and multi-tens of GB/s bandwidth. It is immediately applicable to MoE communication libraries (e.g., DeepEP), as well as to compiler-generated CUDA kernels that require native, inline network communication. The hardware/software design facilitates new computational paradigms tightly coupling computation and communication without host synchronization bottlenecks, while maintaining interoperability with NCCL’s production infrastructure (Hamidouche et al., 19 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

GPU-Initiated Networking for NCCL (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NCCL GIN (GDAKI Backend).