NCCL GIN: GDAKI Backend for GPU Networking
- NCCL GIN (GDAKI Backend) is a GPU-initiated networking extension that enables one-sided RDMA using DOCA GPUNetIO, eliminating CPU intervention in GPU-to-GPU communication.
- It features a three-layer architecture combining a host-side NCCL core, a device-side GIN API with non-blocking primitives, and a network plugin supporting both direct and proxy modes.
- Empirical benchmarks demonstrate sub-17 μs latency and multi-tens of GB/s bandwidth, making it ideal for modern AI workloads such as Mixture-of-Experts applications.
NCCL GIN (GDAKI Backend) refers to the GPU-Initiated Networking (GIN) extension in NCCL 2.28, which enables fully device-driven, one-sided RDMA using the GPUDirect Async Kernel-Initiated (GDAKI) backend. This architecture eliminates CPU intervention in GPU-to-GPU network communication, permitting CUDA kernels to initiate, progress, and synchronize fine-grained transfers over InfiniBand or RoCE directly from the device. Developed to address the requirements of modern AI workloads such as Mixture-of-Experts (MoE), GIN GDAKI delivers low-latency and high-bandwidth communication that is tightly integrated with NCCL’s established collective operation infrastructure (Hamidouche et al., 19 Nov 2025).
1. Three-Layer Architecture of NCCL GIN and GDAKI Placement
NCCL GIN is structured as a three-layer architecture:
- NCCL Core (host-side): Manages communicator creation (via
ncclDevCommCreate), collective memory window registration (ncclCommWindowRegister), resource exchange, and GIN backend selection. - Device-GIN API (GPU-side): Provides primitives—
put,putValue,signal, local and remote completion counters/signals,flush, andbarrier—callable directly from CUDA kernels. - GIN Network Plugin (backend): Implements the active RDMA transport. There are two plugin modes:
- GDAKI: Employs DOCA GPUNetIO to enable direct construction and posting of InfiniBand or RoCE work queue entries (WQEs), doorbell notification, and completion queue management, all from device code.
- Proxy: Relays device requests as 64-byte descriptors via lock-free queues to a host thread, which posts verbs and updates completions.
GDAKI operates entirely within the network-plugin layer. If DOCA GPUNetIO is detected at communicator initialization, the GDAKI plugin (libnccl-net-gdaki.so) is loaded, ensuring all device API calls issue pure device-initiated communication with no CPU roundtrips.
2. Device-Side GDAKI API Semantics
GDAKI device API operations are strictly non-blocking, returning immediately to encourage computation/communication overlap. Completion is tracked via local counters or remote signals, all accessible from device memory. The key API elements include:
- Remote put:
put(team, peer, dstWin, dstOff, srcWin, srcOff, bytes, sig, ctr) - Zero-byte remote notification:
signal(team, peer, signalId) - Local completion management: counters and
flush - Remote completion management: signals and their polling/resetting
- Example (ring exchange in CUDA) demonstrates a peer-to-peer put followed by waiting on a remote signal and resetting it.
Semantically, a put generates a WQE in GPU-addressable BAR memory, followed by a doorbell write to the NIC’s register space. Completion is established either through local counter incrementation (modulo CQ) or remotely, when the corresponding signal counter is visible to the recipient GPU.
3. GDAKI Plugin internals and DOCA GPUNetIO
The GDAKI plugin, distributed as libnccl-net-gdaki.so, implements NCCL’s network API in "direct" device mode using DOCA GPUNetIO:
- Resource Allocation: Each context allocates specific NIC-side resources (queue pairs, completion queues) for the GPU via DOCA device verbs.
- WQE Construction: CUDA kernels format RDMA work requests into a BAR-mapped, GPU-accessible ring buffer.
- Doorbell Mechanism: Device code stores to the mapped NIC doorbell register to initiate DMA.
- Completion Processing: NIC hardware autonomously executes WQEs, conducts RDMA transfers, and writes CQEs into a completion queue mappable into GPU address space.
- Signal/Counter Updates: Local and remote completions are tracked by 64-bit counters (either on local or peer GPU BAR-mapped memory), which are polled or signaled by kernels.
In steady-state, this pipeline requires no active host thread involvement. Control flow, synchronization, and completion signaling are entirely managed by GPU code in conjunction with BAR-mapped device/NIC resources.
4. Performance Modeling and Empirical Results
RDMA latency and throughput under GDAKI are analytically and empirically characterized. Latency is modeled as:
where denotes round-trip doorbell/NIC/network fixed costs, and is the inverse bandwidth slope.
- Microbenchmarks: For 4–128 B ping-pong between two H100s, GDAKI yields .
- Bandwidth: Bulk transfer saturates the NIC’s port rate (e.g., 50 GB/s).
- DeepEP MoE Kernels: Throughput for put calls over total kernel execution time is:
Benchmark Table
| Test Scenario | NCCL GIN (GDAKI) | NVSHMEM IBGDA |
|---|---|---|
| 2-node dispatch BW (16 GPUs, BF16) | 84.36 GB/s | 84.97 GB/s |
| 8-node combine BW (64 GPUs, BF16) | 53.1 GB/s | 54.0 GB/s |
| Single-node dispatch latency (8 GPUs, 14 KB) | 40.62 μs | 41.43 μs |
| Two-node dispatch latency (14 KB) | 142.5 μs | 157.0 μs |
All benchmarks were conducted on EOS DGX H100 clusters (ConnectX-7 InfiniBand, CUDA 12.x, NCCL 2.28, NVSHMEM 3.4.5).
5. Practical Integration Steps and Supported Workflows
For the GDAKI backend, system and integration requirements are:
- Build: CUDA 12.2+, ConnectX-6 Dx+ NICs, OFED/MOFED, and kernel modules (nv_peer_mem or dmabuf).
- Runtime: Set
NCCL_GIN_BACKEND=gdakito force direct device mode. - Communication Path Setup:
- Host-side: Create device communicator with
ncclDevCommCreate, register memory windows collectively. - Device-side: In CUDA kernels, instantiate
ncclGinobjects and invokeput,signal, andwaitSignalmethods. - Cleanup:
ncclCommWindowUnregister,ncclDevCommDestroy, and plugin teardown.
- Host-side: Create device communicator with
For MoE libraries such as DeepEP:
- Each expert-parallel communication channel maps to a dedicated GIN context (typically 4 per communicator).
- Kernels use
(window, offset)pairs rather than raw pointers to identify buffers for RDMA. - Circular buffer management is accomplished with
putandsignalAdd, coordinated through remote head/tail signals.
6. Constraints, Compatibilities, and Roadmap
- Hardware/Software: GDAKI requires NICs that export GPU-accessible doorbells and BAR-mapped WQE/CQ buffers (ConnectX-6 Dx or newer), CUDA 12.2+, and optimal results when NIC and GPU share a PCIe root complex.
- Fallback: The Proxy backend supports a broader range of NICs and GPUs (Volta+), sacrificing only modest extra latency.
- Current limitations:
- Identical window size enforcement across ranks (NCCL 2.28).
- No WQE batching; each
putrings the NIC doorbell individually. - Static, bounded number of signal/counter IDs per communicator.
- Planned enhancements:
- Batched doorbell notifications to reduce per-operation cost.
- New one-sided primitives (e.g., remote atomics, TMA offload).
- Context/QP multiplexing and adaptive NIC load balancing.
- Fused collectives using device-primitives within NCCL’s algorithmic structures.
7. Significance and Areas of Application
GIN’s GDAKI backend brings zero-CPU-overhead, direct device-to-network offload to NCCL, achieving sub-17 μs single-message latencies and multi-tens of GB/s bandwidth. It is immediately applicable to MoE communication libraries (e.g., DeepEP), as well as to compiler-generated CUDA kernels that require native, inline network communication. The hardware/software design facilitates new computational paradigms tightly coupling computation and communication without host synchronization bottlenecks, while maintaining interoperability with NCCL’s production infrastructure (Hamidouche et al., 19 Nov 2025).