Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 95 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 90 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Kimi K2 192 tok/s Pro

2000 character limit reached

MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications (2504.09014v2)

Published 11 Apr 2025 in cs.DC and cs.AI

Abstract: Modern cutting-edge AI applications are being developed over fast-evolving, heterogeneous, nascent hardware devices. This requires frequent reworking of the AI software stack to adopt bottom-up changes from new hardware, which takes time for general-purpose software libraries. Consequently, real applications often develop custom software stacks optimized for their specific workloads and hardware. Custom stacks help in quick development and optimization, but incur a lot of redundant efforts across applications in writing non-portable code. This paper discusses an alternative communication library interface for AI applications that offers both portability and performance by reducing redundant efforts while maintaining flexibility for customization. We present MSCCL++, a novel abstraction of GPU communication based on separation of concerns: (1) a primitive interface provides a minimal hardware abstraction as a common ground for software and hardware developers to write custom communication, and (2) higher-level portable interfaces and specialized implementations enable optimization for different workloads and hardware environments. This approach makes the primitive interface reusable across applications while enabling highly flexible optimization. Compared to state-of-the-art baselines (NCCL, RCCL, and MSCCL), MSCCL++ achieves speedups of up to 5.4$\times$ for collective communication and up to 15% for real-world AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and is also adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open-source and available at https://github.com/microsoft/mscclpp.

Collections

Summary

The paper introduces SpeCL, a novel GPU communication stack that uses a layered API design to address limitations in existing communication libraries.
The paper demonstrates a low-level primitive API that offers asynchronous, fine-grained control over GPU hardware, achieving up to 3.8x speedup versus NCCL.
The paper shows that SpeCL significantly enhances performance in collective operations and end-to-end AI workloads on NVIDIA and AMD GPUs.

This paper introduces SpeCL (Microsoft Collective Communication Library ++), a novel GPU communication stack designed to address the limitations of existing libraries like NCCL, RCCL, and MSCCL in the context of rapidly evolving AI hardware and application-specific optimization needs (2504.09014). The authors argue that current libraries often lack the flexibility required for cutting-edge performance, forcing developers to create custom, non-portable communication stacks, leading to redundant effort.

Core Idea: Separation of Concerns

SpeCL's central design principle is the separation of concerns, offering hierarchical interfaces:

SpeCL Primitive API: A low-level, in-kernel API providing fundamental communication building blocks: put, get, signal, wait, and flush. These primitives are designed to be zero-copy, one-sided, and asynchronous, offering fine-grained control close to the hardware. This minimal hardware abstraction aims to be a common ground for software and hardware developers, facilitating quick adaptation to new hardware features.
SpeCL DSL API: Re-implements and extends MSCCLang [cowan2023mscclang] over the SpeCL Primitive API. It allows developers to define custom collective communication algorithms in a high-level Python-based language, which are then executed by a GPU kernel (DSL Executor). This targets users needing custom algorithms without deep hardware expertise.
SpeCL Collective API: Provides an NCCL-compatible API built on top of the SpeCL stack. This allows applications using NCCL/RCCL to adopt SpeCL with minimal code changes, while also allowing users to plug in custom algorithms developed using the DSL API for better performance.

Communication Abstractions

SpeCL abstracts underlying hardware communication mechanisms into channels that can be invoked from GPU kernels:

PortChannel: Abstracts port-mapped I/O (PMIO), typically involving dedicated hardware engines like DMA or RDMA NICs. Often requires CPU assistance for initiation (e.g., via ibv_post_send for InfiniBand, cudaMemcpy for DMA), but SpeCL handles this transparently using helper CPU threads and request queues. Used for DMA-copy over NVLink, xGMI, PCIe, and InfiniBand.
MemoryChannel: Abstracts memory-mapped I/O (MMIO), allowing direct peer-to-peer GPU memory access using GPU threads (thread-copy). Offers two protocols: High-Bandwidth (HB) for large messages (amortized synchronization) and Low-Latency (LL) for small messages (frequent, fine-grained synchronization). Used for thread-copy over NVLink, xGMI, and PCIe.
SwitchChannel: Abstracts switch-based collective operations, like NVIDIA's NVSwitch with NVLink SHARP (NVLS) technology [nvls], enabling hardware-accelerated reductions and broadcasts using multimem instructions.

Primitives

put(dstOffset, srcOffset, size, ...): Asynchronously initiates data transfer from the source buffer (local GPU) to the destination buffer (remote GPU) associated with the channel.
get(...): (Implicitly supported through read/write in MemoryChannel) Allows fetching data from a remote GPU.
signal(): Asynchronously signals the remote GPU, typically indicating data availability or completion. Ordered relative to preceding put operations.
wait(): Synchronously waits for a signal from the remote GPU (e.g., waits for a semaphore to reach an expected value).
flush(): Locally synchronizes, ensuring that all preceding asynchronous put and signal operations on the calling GPU have been initiated or completed (depending on channel type), making local source buffers safe to reuse.

Advantages Highlighted

Flexibility & Customization: The Primitive API allows developers to implement highly customized communication patterns fused with computation kernels, enabling optimizations not easily achievable with NCCL's host-called, synchronous primitives.
Performance: Achieved through asynchronous operations, batching synchronization (signal/wait), specialized kernels with lower overhead (e.g., fewer register spills compared to NCCL), ability to choose optimal data transfer modes (thread-copy vs. DMA-copy via Memory/Port Channels), and exploiting hardware features like NVLS (via SwitchChannel).
Hardware Adaptability: The low-level primitive interface simplifies supporting new hardware. The paper cites rapid bring-up for NVIDIA H100 NVLS (8 weeks) and AMD MI300x (7 weeks, <10 lines of AMD-specific code in the core library).
Reduced Development Effort: Provides reusable primitive building blocks, reducing the need to write custom stacks from scratch.

Implementation Details

Initialization: Uses a Bootstrap mechanism (default: POSIX sockets) for inter-process metadata exchange and a Communicator object to manage buffer registration and channel creation.
PortChannel Implementation: Uses managed memory queues for GPU->CPU requests and helper CPU threads to invoke underlying transfer/atomic operations (e.g., ibv_post_send, ibv_atomic_add).
MemoryChannel Implementation: Directly uses GPU threads for copies. HB protocol synchronizes large chunks via signal/wait; LL protocol uses flags embedded with data for finer-grained synchronization via read/write.
SwitchChannel Implementation: Leverages PTX multimem instructions (multimem.ld_reduce, multimem.st) for NVLS.
Memory Consistency: Relies primarily on atomic operations (compliant with C++11 memory model) rather than weaker hardware/compiler features for robustness and portability across evolving hardware/compilers.

Evaluation

Environments: NVIDIA A100 (40G/80G), H100, AMD MI300x GPUs with NVLink/Infinity Fabric and InfiniBand.
Baselines: NCCL, RCCL, MSCCL.
Collective Microbenchmarks (AllReduce):
- SpeCL significantly outperforms baselines across various GPU types, node counts (1, 2, 4), and message sizes.
- Speedups up to 3.8x for small messages (latency) and 2.2x for large messages (algorithmic bandwidth) were observed compared to NCCL/RCCL/MSCCL.
- Demonstrates the effectiveness of SpeCL-specific algorithms (e.g., 1PA, 2PA, 2PH, 2PAM) and the efficiency gains from the primitive API implementation even when using the same high-level algorithm as MSCCL.
- For H100 NVLS (2PAM algorithm), SpeCL significantly outperforms NCCL/MSCCL using the same underlying hardware feature, indicating lower software overhead in SpeCL primitives.
LLM Inference (Llama2-70b on vLLM):
- Showed 4% - 15% end-to-end speedup for the decode phase (dominated by small AllReduce operations for tensor parallelism) on a single A100-80G node compared to using NCCL.

Conclusion

SpeCL proposes a rethinking of GPU communication abstractions by providing a layered approach centered around a flexible, low-level Primitive API. This design facilitates hardware-specific optimization, enables co-design of communication and computation, accelerates the adoption of new hardware, and demonstrably improves performance for collective operations and end-to-end AI workloads compared to state-of-the-art libraries (2504.09014). It is open-source and in production use within Microsoft Azure and adopted by AMD's RCCL.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (13)

GitHub

GitHub - microsoft/mscclpp: MSCCL++: A GPU-driven communication stack for scalable AI applications (331 stars)