HetCCL: Cross-Vendor Collective Library
- HetCCL is a cross-vendor collective communication library that unifies vendor-native backends for efficient LLM training on heterogeneous GPU clusters.
- It leverages RDMA-based peer-to-peer transfers and native optimizations to perform collective operations without modifying deep learning applications.
- Experimental evaluations show near-native performance in mixed configurations, enabling scalable and cost-effective training across NVIDIA and AMD GPUs.
HetCCL is a cross-vendor collective communication library designed to enable high-performance, LLM training across heterogeneous GPU clusters composed of NVIDIA and AMD devices. HetCCL unifies vendor-specific backends—NVIDIA NCCL and AMD RCCL—while providing efficient Remote Direct Memory Access (RDMA)-based communication between GPUs from different vendors, without requiring modifications to device drivers or existing deep learning applications. By maintaining vendor-native optimizations and introducing two novel mechanisms for cross-vendor collectives, HetCCL achieves near-native performance in both homogeneous and heterogeneous environments, uniquely supporting scalable mixed-vendor LLM workloads (Kim et al., 30 Jan 2026).
1. Motivation and Limits of Existing Solutions
The expansion of LLM workloads often leads organizations to procure additional GPU resources opportunistically, resulting in clusters containing a mix of NVIDIA and AMD devices. Large-scale LLM training relies heavily on fine-grain, frequent collective operations—including All-Reduce, All-Gather, and Reduce-Scatter—for synchronizing gradients and model states. Existing vendor collectives, such as NCCL (CUDA) and RCCL (HIP), are each optimized for their respective hardware and do not interoperate. Thus, deep learning frameworks historically select one backend per job, precluding simultaneous use of both vendor types in a single collective operation.
Work-arounds, such as relaying data through host memory or sharding jobs across vendor-homogeneous partitions, result in significant inefficiency and increased complexity. Multi-vendor CCL implementations (e.g., MSCCL++, TorchComms) and MPI-based ('CUDA-aware'/ROCm-aware) solutions suffer similar limitations, as they either discard the non-chosen vendor code path at compile time or lack the performance optimizations of native collectives for large tensor work. Additionally, vendor protocol differences, such as NVLink vs. xGMI, prevent direct peer-to-peer memory access between different vendor GPUs (Kim et al., 30 Jan 2026).
2. Architecture and Design Principles
HetCCL is built on three core principles: preserving vendor-native optimizations, introducing a transparent orchestration layer, and leveraging RDMA for direct GPU-GPU transfers irrespective of vendor. Its architecture is as follows:
- Library Deployment: HetCCL is delivered as a shared library (LD_PRELOADed to replace nccl.so/rccl.so). Existing frameworks (e.g., PyTorch+DeepSpeed) invoke standard collective calls, which HetCCL intercepts, invoking either native libraries or RDMA as appropriate.
- Runtime API Abstraction (TACC): The Thin Abstraction of CUDA/HIP Collectives (TACC) exports unified APIs for memory allocation, device-to-device copies, and stream synchronization. At initialization, TACC selects the CUDA or HIP runtime per process.
- Dual Device Code Packaging: Device code for additional kernels is segregated into two shared objects, built with vendor toolchains (nvcc for NVIDIA, hipcc for AMD). These are dynamically linked and resolved at runtime.
- Backend Selection and Execution: For intra-vendor communication, HetCCL delegates to NCCL/RCCL. For cross-vendor, it invokes RDMA-based mechanisms (detailed below).
3. Novel Cross-Vendor Communication Mechanisms
HetCCL introduces two mechanisms enabling efficient heterogeneous collectives:
- RDMA-Based Heterogeneous Peer-to-Peer (P2P): Once GPU buffers are allocated via cudaMalloc or hipMalloc and registered with an RDMA-capable NIC, InfiniBand (or RoCE) can perform DMA directly between these buffers without regard to vendor. HetCCL creates an IB-Verbs queue pair for each GPU peer and uses ibv_post_send/ibv_post_recv to transfer tensor chunks between NVIDIA and AMD GPUs, eliminating the need for host-based relaying and maintaining high throughput.
- Vendor-Local Collectives with Cross-Vendor Orchestration: For a collective operation among GPUs, HetCCL partitions them by vendor. Each group performs a vendor-native collective (e.g., NCCL or RCCL AllReduce). Partial results are then exchanged across vendor groups using RDMA and aggregated, either in host-pinned memory or via device-side reduction kernels launched through TACC.
A high-level pseudocode for HetCCL's All-Reduce algorithm demonstrates this orchestration by partitioning ranks, leveraging vendor-native collectives for intra-group reductions, and using RDMA for inter-group data exchange and final reduction (Kim et al., 30 Jan 2026).
4. Communication Model and Performance Implications
HetCCL's performance model builds on classic collective latency/bandwidth parameters:
- For homogeneous collectives using vendor libraries:
where is startup latency and is reciprocal bandwidth.
- For RDMA-enabled cross-vendor transfers:
Here, is the NIC's queue-pair setup cost (amortized), and is line-speed bandwidth (experimentally, GB/s on HDR IB, compared to host-staged GB/s).
- For ring-based collectives over peers:
Substituting HetCCL's RDMA parameters for cross-vendor hops yields similar scaling to homogeneous collectives, with bottleneck bandwidth determined by the slower vendor.
Empirical results confirm that, in homogeneous scenarios, HetCCL matches the throughput of NCCL (~16 GB/s for 8×NVIDIA) and RCCL (~14 GB/s for 8×AMD). In mixed configurations (8×NVIDIA + 8×AMD) over four nodes, HetCCL sustains ~13–14 GB/s, matching the lower of the homogeneous baselines and scaling linearly with node/GPU count (Kim et al., 30 Jan 2026).
5. Implementation Insights
- RDMA Integration: GPU buffers are pinned and registered with ibv_reg_mr(). Remote keys and addresses are exchanged via out-of-band channels (e.g., MPI or TCP). For each cross-vendor GPU pair, an InfiniBand queue pair is established and used for direct send/receive operations.
- Intra-Node Optimizations: Within a node, and when communicating across GPUs of the same vendor, HetCCL wholly delegates to NCCL or RCCL. Native libraries select the optimal path (NVLink, PCIe P2P, or CPU staging) without intervention.
- Buffer Management: All collective buffers are pre-allocated and reused per communicator. Host-pinned buffers are employed as necessary, and device-side reduction kernels are compiled for each vendor and dispatched via TACC.
- Drop-in Compatibility: User applications and deep learning frameworks require no modification; HetCCL operates as a transparent replacement for vendor-specific collective libraries (Kim et al., 30 Jan 2026).
6. Experimental Evaluation
The experimental platform consists of four nodes connected by Mellanox ConnectX-6 HDR InfiniBand: 2× nodes with 4× NVIDIA V100-PCIe GPUs each (CUDA 12.4), and 2× nodes with 4× AMD W7800 GPUs each (ROCm 6.4.0). Representative LLMs used for evaluation include GPT (125 M, 355 M parameters, seq_len=1024) and LLaMA (1 B, 3 B parameters, seq_len=8192).
- Bandwidth Evaluation: In homogeneous settings, HetCCL matches NCCL and RCCL bandwidths; heterogeneous HetCCL reaches ~18 GB/s (limited by V100 PCIe3), compared to ~8 GB/s for host-staging.
- Collective Throughput: For 1 GB All-Reduce/All-Gather/Reduce-Scatter, mixed-vendor HetCCL achieves ~14 GB/s, scaling to 16 GPUs with sustained ~13 GB/s.
- End-to-End LLM Training Efficiency: On mixed 4×NVIDIA + 4×AMD, GPT-125M throughput is ≈1.48× that of NVIDIA-only and 2.97× AMD-only. LLaMA-3B achieves ~90% aggregate efficiency relative to the sum of individual vendor throughput on 16 GPUs. Model convergence, in terms of loss curves (bf16), is identical across all library backends (Kim et al., 30 Jan 2026).
7. Implications, Limitations, and Future Work
HetCCL preserves native vendor performance in homogeneous environments because all collective calls are forwarded directly to NCCL/RCCL, introducing negligible overhead. The TACC layer operates purely as a low-latency function pointer dispatch, and device kernel indirection is minimal.
In heterogeneous clusters, HetCCL outperforms host-staged or MPI-based solutions by leveraging GPUDirect RDMA, thereby eliminating the cost of double host-device copies and bypassing limited CPU-NIC bandwidth. Shrinking cross-vendor data volume via local collectives further improves efficiency, and reliance on optimized vendor kernels for reductions avoids the performance penalties of generic or CPU-side implementations.
Operationally, HetCCL enables unified training jobs spanning both NVIDIA and AMD GPUs, increasing hardware utilization and improving cost-effectiveness. Scaling is near-linear to 16 GPUs; the design suggests similar behavior at larger cluster sizes under InfiniBand or equivalent RDMA fabrics.
Current limitations include the assumption that each node is single-vendor (cross-vendor GPU communication within a single node is future work), static load balancing via offline profiling of micro-batch ratios, and experimental validation limited to 16 GPUs. Plans are noted to explore dynamic, per-step profiling and to extend the model to larger, more diverse clusters (Kim et al., 30 Jan 2026).
In summary, HetCCL provides a transparent, RDMA-enabled, cross-vendor collective library that leverages vendor-native libraries and communicates efficiently across heterogeneous GPU clusters, enabling scalable, high-performance LLM training on mixed NVIDIA and AMD infrastructure without requiring modifications to user code or deep learning frameworks (Kim et al., 30 Jan 2026).