Comparative evaluation with NCCL GPU‑Initiated Networking

Determine the comparative communication performance of the stream‑triggered MPI GPU communication API and implementation on HPE Slingshot 11 network interface cards relative to NVIDIA NCCL GPU‑Initiated Networking by enabling both systems on a common interconnect (for example, by porting the MPI stream‑triggered implementation to NVIDIA InfiniBand or porting NCCL GPU‑Initiated Networking to HPE Slingshot) and conducting controlled benchmarks.

Background

The paper introduces a stream‑triggered MPI GPU communication API and a CPU‑free implementation targeting HPE Slingshot 11 via libfabric deferred work queues and counters. NVIDIA’s NCCL recently added GPU‑Initiated Networking, which also aims to provide CPU‑free GPU communications.

A direct performance comparison between the stream‑triggered MPI implementation and NCCL GPU‑Initiated Networking has not been performed because the MPI implementation has not been ported to InfiniBand and NCCL GPU‑Initiated Networking is not available on HPE Slingshot. Establishing such a comparison would clarify relative strengths across message sizes and communication patterns.

References

Finally, NCCL has recently implemented CPU-free communication; we have not been able to compare our performance with this system because our API has not been ported to Infiniband, and NCCL GPU-Initiated Networking has not been ported to HPE Slingshot.

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation  (2602.15356 - Bridges et al., 17 Feb 2026) in Related Work (Section 6), final paragraph