GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems

Published 24 Apr 2026 in cs.DC | (2604.22126v1)

Abstract: Distributed GPU applications increasingly rely on kernel-level, cross-node coordination to reduce launch overheads and improve compute-communication overlap, but such support is lacking. On OFI-based interconnects such as HPE Slingshot, which powers six of the top ten systems in the November 2025 Top500, including the top three, GPU kernels cannot autonomously drive distributed coordination: existing runtimes rely on host-driven progress and lack a bounded mechanism for recycling pre-staged NIC work across repeated GPU-triggered operations. On InfiniBand, GPU-initiated communication is possible, but current implementations incur unnecessary synchronization and locking overheads. This paper presents GICC, a framework that enables GPU kernels to directly trigger NIC-level operations without host involvement on the fast path. In stencils, GPU threads initiate halo exchanges as soon as boundary regions are computed, enabling fine-grained overlap between interior computation and boundary transfer. GICC decouples coordination semantics from data movement and introduces asynchronous resource reclamation: the NIC signals completion to both GPU and host memory, letting a lightweight host thread recycle NIC resources concurrently with GPU execution without injecting latency into the coordination path. This sustains GPU-driven coordination under finite NIC state, absent from existing OFI-based runtimes. We implement GICC on NVIDIA and AMD GPUs over InfiniBand and Slingshot. On Slingshot, GICC reduces per-coordination latency by up to 229x and improves weak scaling efficiency by up to 25%. On InfiniBand, it achieves up to 1.95x lower put latency than NVSHMEM by eliminating unnecessary locking and synchronization. On an industrial stencil proxy on 64 AMD MI250X GCDs, GPU-aware MPI incurs over 52% higher communication time than GICC, which achieves 42% parallel efficiency versus MPI's 35.4%.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents GICC, a GPU-driven runtime that eliminates host intervention to enable device-side synchronization and communication in distributed HPC systems.
It leverages pre-staged NIC work for OFI fabrics and direct RDMA on InfiniBand, achieving up to 229× lower coordination latency compared to host-driven approaches.
GICC demonstrates practical benefits with reduced communication time in stencil computations and improved weak-scaling efficiency over traditional MPI implementations.

GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems

Introduction and Motivation

The transition toward exascale high-performance computing (HPC) has rendered distributed heterogeneous GPU clusters the dominant architecture for many domains. While hardware advances have delivered immense computational throughput, distributed GPU applications have struggled to efficiently exploit cross-node coordination due to persistent architectural and runtime constraints. Notably, GPU-initiated coordination remains underdeveloped, especially on OFI-based interconnects like HPE Slingshot, which serve as the backbone for the leading Top500 systems. Existing communication paradigms, including MPI and SHMEM derivatives, impose excessive latency and synchronize kernels at host intervention points, impeding fine-grained compute–communication overlap.

GICC (GPU-Initiated Communication and Coordination) directly addresses these deficiencies by introducing a GPU-driven, resource-resilient runtime that enables device-side distributed synchronization and communication, decoupled from host control. GICC is explicitly engineered for both InfiniBand and OFI-based fabrics, with special treatment of Slingshot/CXI NICs, whose resource limitations and provider semantics had previously precluded sustained GPU-driven coordination.

Structural Analysis of Host-Driven Coordination

Existing approaches are rooted in host-initiated orchestrations. The typical workflow involves kernels handing off at boundaries to host CPUs, which then invoke collectives and enforce progress. Such models, when applied to workloads with high coordination frequency or numerous kernel phases (e.g., iterative solvers, stencil computations), incur prohibitive overhead, with empirical results showing up to 32% of the execution time spent in coordination for a 200-phase workload.

The triggering path for communication can be categorized as follows:

Figure 1: Four common communication modes between GPU and NIC, illustrating host-driven, host-mediated, GPU-initiated, and GPU-triggered workflows.

Host-driven and host-mediated modes, which typify NVSHMEM and MPI implementations on Slingshot, are inherently incapable of device-side coordination autonomy. In contrast, InfiniBand enables direct GPU-initiated communication. However, even on InfiniBand, inefficiencies due to locking and API design remain non-trivial constraints.

Quantitative microbenchmarks confirm that as the number of phase boundaries $N$ increases, the fraction of host-driven coordination time dominates total runtime:

Figure 2: Execution time breakdown evidencing coordination overhead scalability as a function of increasing phase count $N$ .

The GICC Runtime: Design Principles and Execution Model

GICC is built around three tenets: (1) permitting device-side coordination initiation within GPU kernels, (2) minimizing both latency and CPU involvement on the critical path, and (3) bounding NIC-resident state usage to ensure liveness and progress under finite resources.

On InfiniBand, GICC leverages device-accessible doorbells/UARs, fully eliminating host mediation for RDMA and synchronization. On OFI/CXI, the host pre-stages all necessary NIC work, which the GPU then triggers via thresholded counter updates. Explicit host threads asynchronously recycle and re-arm NIC state to sidestep resource exhaustion while keeping the coordination path device-resident.

The execution model on Slingshot can be summarized as:

Figure 3: Boundary conditions for GPU-driven coordination on Slingshot (CXI) under libfabric manual progress, capturing pre-staging, triggering, GPU-visible completion, and host-mediated progress.

GICC’s API exposes device-side synchronization, active messages, and put/get semantics, with kernel-level coordination points decoupled from host round-trips. RDMA operations and collectives, including barriers, are realized as finite sequences of host-armed, device-triggerable work, managed through a double-buffered, epoch-based slotting mechanism. This avoids blocking flushes and maintains low pre-staged work footprints regardless of usage frequency or scale.

Microbenchmark Evaluation and Performance Analysis

GICC outperforms both host-driven MPI and hybrid runtimes across all evaluated dimensions. Microbenchmarks on Tioga (Slingshot + AMD MI250X) and Maple (InfiniBand HDR + NVIDIA GH200) substantiate the substantial reduction in overhead achieved by direct GPU-driven coordination.

For high-coordination-frequency workloads, GICC exhibits up to $229\times$ lower per-coordination latency ( $0.11\,\mu$ s vs $25.2\,\mu$ s), maintaining flat execution times as $N$ increases.

Figure 4: End-to-end slowdown and per-coordination latency as $N$ increases, comparing GICC to host-driven baselines.

Point-to-point put latency on both InfiniBand and Slingshot confirms GICC's streamlined device path. For small messages, GICC reduces put latency by up to $1.95\times$ compared to NVSHMEM, primarily due to the elimination of proxy threads, locking, and redundant synchronization.

Figure 5: (Left) P2P put latency on Tioga (Slingshot + AMD MI250X). (Right) P2P put latency across different scopes on Maple (InfiniBand HDR + GH200).

Application-Level Benchmarks

GICC’s impact extends to concrete applications:

Weak scaling on 2D Jacobi: Sustains >93% efficiency intra-node and outperforms MPI by 25% on 64 GPUs, directly attributable to efficient GPU-driven halo exchange and reduced host mediation.
Figure 6: Weak-scaling efficiency of a 2D Jacobi stencil on Tioga, highlighting GICC's superior scaling.
Distributed Matrix Multiplication (Cannon): GICC yields up to 6.4% speedup on small matrices; for larger sizes, GICC and MPI converge, reflecting reduced relative overhead when fewer fine-grained coordination points exist.
Figure 7: GICC speedup over MPI for distributed matrix multiplication, evidencing the major gains for phase-intensive workloads.
Minimod (industrial stencil proxy): At 64 GPUs, GICC outperforms MPI with 52% lower communication time and achieves 42% parallel efficiency versus MPI’s 35.4%. This is a direct result of coordinating complex, multi-kernel phases from the GPU side while hiding host-NIC synchronization latency.
Figure 8: Parallel efficiency and communication time for Minimod on Tioga, contrasting GICC and MPI across the scaling range.

Implications, Limitations, and Future Directions

GICC resolves a critical gap in device-driven control for distributed GPU applications, particularly under resource-constrained OFI/CXI fabrics. The clear separation of coordination from communication, and the enforcement of bounded pre-staged work with resource-aware reclamation, enables sustainable GPU-native control at scale. This brings practical benefits to multi-phase, communication-heavy scientific codes, supporting tighter compute–communication overlap and unlocking new algorithmic strategies.

However, the fundamental limitation remains: on OFI/CXI, GPUs cannot dynamically enqueue new NIC operations. GICC’s resource model, while robust, is best suited for semi-regular (predictable) coordination patterns where the action space can be staged ahead. Fully dynamic or unpredictable communication graphs may suffer back-pressure or require defensive host fallback. Additionally, the difference in resource semantics across InfiniBand and OFI introduces subtle, non-portable aspects concerning progress guarantees and resource consumption.

Theoretically, these advances imply that further hardware–runtime co-design is necessary to fully bridge the gap between device-centric programming and tightly constrained NIC state. Practically, many high-value scientific workloads stand to benefit immediately from GICC, particularly those with iterative, fine-grained synchronization boundaries.

Future work should extend GICC to commercial cloud fabrics (e.g., AWS EFA), generalize the resource-pipelining model, and integrate with compiler toolchains to enable architectural abstraction without sacrificing backend-specific performance.

Conclusion

GICC establishes a new paradigm for sustaining fine-grained, GPU-initiated coordination and communication in modern distributed HPC clusters. By explicitly addressing both interconnect-specific and generic resource management issues, decoupling coordination semantics from data movement, and providing practical, high-performance implementations on both AMD and NVIDIA hardware, GICC enables device-resident distributed control that was previously unattainable on the majority of large-scale systems. The framework demonstrates large reductions in latency, improved scaling, and direct application-level benefits, especially for communication-intensive, phase-heavy computations, setting a clear trajectory for future design in distributed accelerator runtimes.

Markdown Report Issue