Thin Abstraction of CUDA/HIP Collectives (TACC)
- Thin Abstraction of CUDA/HIP Collectives (TACC) is a runtime API layer that unifies NVIDIA and AMD GPU collectives with minimal overhead.
- It dynamically dispatches memory, stream, and kernel operations to native NCCL and RCCL libraries while leveraging RDMA for cross-vendor communication.
- Performance analyses show TACC achieves near-native efficiency, scaling heterogeneous setups with up to 97% of the combined homogeneous baseline.
The Thin Abstraction of CUDA/HIP Collectives (TACC) refers to a runtime API abstraction layer at the core of HetCCL, a collective communication library designed for unified, high-performance, vendor-agnostic collective communication across heterogeneous NVIDIA and AMD GPUs. TACC provides a minimal, dynamically dispatched interface for memory, stream, and kernel operations, enabling seamless orchestration of existing vendor-native libraries—NCCL for NVIDIA and RCCL for AMD—while transparently supporting cross-vendor interconnects through RDMA over InfiniBand. This approach ensures that deep learning applications operate without code changes, achieving efficient collectives regardless of GPU hardware diversity and preserving full microarchitectural optimizations of vendor libraries (Kim et al., 30 Jan 2026).
1. Architectural Overview
TACC sits between deep learning frameworks (such as PyTorch) and the vendor-provided collective communication libraries (NCCL and RCCL). The architecture comprises:
- Runtime API abstraction (TACC): Offers a single, transparent interface abstracted from both CUDA and HIP runtimes for device, stream, memory, and kernel operations.
- Device-code packaging scheme: Separately compiles CUDA and HIP kernels into shared objects. These are loaded dynamically at runtime, ensuring that both NVIDIA and AMD GPUs are supported natively.
- Shim layer and orchestration: When a collective (e.g., All-Reduce) is invoked, a HetCCL-injected shim calls TACC to configure device context and dispatches the collective call to NCCL or RCCL for intra-vendor operations. For cross-vendor GPU communication, TACC manages transfers using RDMA paths but never rewrites vendor library code.
This structure allows HetCCL to orchestrate calls across native vendor libraries plus InfiniBand Verbs, enabling true cross-vendor collectives without requiring modifications to NVIDIA or AMD drivers.
2. TACC API and Runtime Dispatch
TACC mirrors the subset of CUDA and HIP runtime APIs required by NCCL and RCCL. Key aspects include:
- Platform selection: Initialization routines such as
taccSetPlatformAuto()(inspecting.sobinary name or environment variable) andtaccGetPlatform()automatically choose the backend at first use and expose platform identity. - Device management: Direct mapping of device selection and context management (
taccGetDevice(int*),taccSetDevice(int)). - Memory operations: Unified allocation, registration, and inter-process communication (IPC) memory handle management.
- Stream and synchronization management: Creation, association, and synchronization of streams and device-wide barriers, mapping to native CUDA or HIP semantics.
- Kernel launch: User kernels are invoked through
taccLaunchKernel()andtaccExtLaunchKernel(), dynamically selecting the function pointer from the active vendor's function table.
The core mechanism relies on two function-pointer tables mapping to cuda*.so and hip*.so routines, with each call dynamically dispatched to the selected backend.
3. Cross-Vendor Mechanisms
TACC enables two key mechanisms for bridging vendor boundaries:
3.1 Vendor-Agnostic RDMA
- Any buffer allocated via
cudaMallocorhipMallocand registered withibv_reg_mris exposed as a PCIe BAR address, indistinguishable by InfiniBand Verbs. - TACC allocates device memory, registers it for RDMA at initialization, and data transfers occur directly from GPU to network interface and back (GPU → NIC → network → NIC → GPU), bypassing the CPU. No modifications to vendor drivers are necessary.
3.2 Orchestration of Cross-Vendor Collectives
- The MPI communicator (or NCCL ring) is split into NVIDIA and AMD subgroups.
- HetCCL executes a four-step algorithm for cross-vendor collectives:
- Vendor-local All-Reduce (NCCL or RCCL) within each subgroup.
- Pairwise RDMA exchange of partial results between designated subgroup leaders.
- Local accumulation of partial sums on each leader.
- Vendor-local broadcast (NCCL or RCCL).
The approach preserves vendor-optimized local collectives and only orchestrates cross-vendor data movement at the boundary.
4. Performance Analysis
Performance modeling in the context of TACC follows the classic – network model. Key bandwidth and latency parameters are defined as:
Vendor-local and cross-vendor communications:
- Heterogeneous All-Reduce cost is given by:
- Empirical results show:
- Homogeneous (NVIDIA or AMD only) HetCCL matches NCCL/RCCL within 2% for up to 8 GPUs.
- Heterogeneous HetCCL scales efficiently (12–16 GPUs) with cross-vendor RDMA bandwidth ~14 GiB/s, while intra-vendor reaches ~15 GiB/s (NVIDIA, PCIe Gen3) and ~22 GiB/s (AMD, PCIe Gen4).
- End-to-end LLM training with 8 AMD + 8 NVIDIA achieves up to 1.48× speedup versus 8 NVIDIA NCCL and up to 2.97× over 8 AMD RCCL, with heterogeneous efficiency reaching 90–97% of the combined homogeneous baseline (Kim et al., 30 Jan 2026).
5. Implementation and Engineering Considerations
Several practical challenges are addressed via TACC's design:
- Device selection: TACC intercepts native device queries and maps
taccSetDevicedirectly to vendor runtime calls. - Stream association: Internal TACC streams are opaque and mapped to
cudaStream_torhipStream_tat runtime; collectives are launched on the platform-native handle. - Synchronization: HetCCL invokes
cudaStreamSynchronizeand its HIP equivalent via TACC before/after RDMA to maintain correct operation ordering. A global barrier (MPI/NCCL group) aligns nodes across vendors. - Kernel binary management: CUDA and HIP kernels are compiled into separate shared objects (
libccl_cuda_kernels.so,libccl_hip_kernels.so); these are resolved dynamically for each platform.
6. Evaluated Results and Efficiency
Benchmarks highlight the efficacy of TACC:
| Setup | Bandwidth / Speedup | Notes |
|---|---|---|
| NVIDIA (PCIe Gen3) | ≈ 15 GiB/s (HetCCL ≈ NCCL) | Point-to-point |
| AMD (PCIe Gen4) | ≈ 22 GiB/s (HetCCL ≈ RCCL) | Point-to-point |
| N↔A RDMA | ≈ 14 GiB/s | Cross-vendor, bounded by slowest link |
| Homogeneous | ≤2% overhead vs. NCCL/RCCL (up to 8 GPUs) | Collective bandwidth |
| Heterogeneous | Scales to 12–16 GPUs (4N+4A, 8N+8A) | Stable bandwidth |
| 8A+8N vs. 8N | up to 1.48× | LLM training speedup |
| 8A+8N vs. 8A | up to 2.97× | LLM training speedup |
| Efficiency | 90–97% of sum of homogeneous runs | Heterogeneous setup |
These results confirm that TACC preserves vendor-level optimizations and delivers high efficiency in multi-vendor settings (Kim et al., 30 Jan 2026).
7. Insights, Limitations, and Future Perspectives
- Delegation to vendor-native libraries is crucial, as re-implementing collectives independently of NCCL/RCCL introduces prohibitive performance penalties.
- Minimal abstraction is advantageous; TACC restricts itself to low-level runtime features (allocation, streams, kernel launch), leaving high-level algorithms and optimizations to existing vendor or HetCCL orchestrators.
- RDMA-based cross-vendor communication is enabled with standard InfiniBand Verbs and GPUDirect/DirectGMA, with no required driver modifications.
- Static load-balancing via short profiling phases enables >90% overall efficiency, assigning LLM micro-batches by observed token/s rate to prevent stragglers.
- Current limitations include focus on inter-node heterogeneity; intra-node setups with mixed vendors on the same PCIe root domain remain an open problem.
- Dynamic load-balancing (re-profiling to account for thermal variation or resource contention) is identified as a potential future extension.
TACC exemplifies a generic strategy for unified GPU runtime abstraction: maintain a minimal interface (malloc/free, stream/event, memcpy, kernel launch), leverage dynamic linking for device-specific code, orchestrate cross-vendor operations only at defined boundaries, and preserve all performance-critical native library paths (Kim et al., 30 Jan 2026).