Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects (2408.14090v2)

Published 26 Aug 2024 in cs.DC, cs.AI, cs.AR, cs.NI, and cs.PF

Abstract: Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.

Summary

The paper conducts a comprehensive evaluation of GPU-to-GPU communication across three exascale supercomputers, revealing key performance bottlenecks in both intra-node and inter-node transfers.
It uses targeted benchmarks with technologies like GPU-Aware MPI and NCCL/RCCL to assess point-to-point and collective communication performance.
The results underscore the importance of tailored tuning and noise mitigation strategies for optimizing interconnect performance in next-generation supercomputers.

Insights into GPU-to-GPU Communication in Exascale Supercomputers

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects by Daniele De Sensi et al. presents an in-depth analysis of GPU-to-GPU communication in three state-of-the-art exascale supercomputers: Alps, Leonardo, and LUMI. The paper meticulously evaluates the interconnect performance and scalability, focusing on both intra-node and inter-node communication. This summary aims to distill the key findings and implications of the research for a specialized audience, highlighting the performance characteristics, optimization opportunities, and future directions in multi-GPU supercomputing.

Methodology and Systems

The research analyzes three major supercomputers, each varying in architecture and interconnect technologies:

Alps: Deployed by CSCS with NVIDIA H100 GPUs interconnected via NVLink 4.0.
Leonardo: Featuring NVIDIA A100 GPUs and using NVIDIA InfiniBand HDR.
LUMI: Equipped with AMD Instinct MI250X GPUs, linked by HPE Cray Slingshot 11.

The analysis covers both point-to-point and collective communications using various technologies such as NCCL/RCCL, GPU-Aware MPI, and explicit device-to-device copies.

Intra-Node Performance

Point-to-Point Transfers

The paper's point-to-point benchmarks reveal consistent differences in goodput and latency across the systems:

NVIDIA-based Systems: Achieve near-nominal bandwidth with minimal latency penalties using GPU-Aware MPI for small message sizes.
AMD-based System (LUMI): Shows variability in goodput depending on GPU pair connectivity due to the heterogeneous bandwidth of the Infinity Fabric interconnect.

Collective Operations

For collective operations such as alltoall and allreduce:

NCCL/RCCL demonstrates higher goodput compared to GPU-Aware MPI, particularly for larger messages, due to optimizations tailored for intra-node communications.
MPI Collectives: Struggle with lower performance, particularly on operations that involve data aggregation, potentially due to suboptimal coordination of GPU kernel execution and data transfer synchronization.

Inter-Node Performance

Point-to-Point Communications

Experiments on inter-node point-to-point communication showcase the supremacy of GPU-Aware MPI in achieving higher bandwidth and lower latency due to minimal overhead compared to NCCL/RCCL. Notably:

Leonardo: Experiences significant latency spikes and bandwidth drops when different groups communicate, attributable to network noise and congestion.

Network Congestion and Noise

The research identifies network noise as a significant limiter of scalability, causing up to a 50% degradation in performance for large-scale collectives:

Service Level Selection: Explored as a mitigation strategy in InfiniBand systems like Leonardo. Allocating communication to less congested service levels can mitigate some effects but is not a comprehensive solution, particularly under heavy network load.

Implications and Future Directions

This comprehensive performance characterization underscores several critical implications:

Optimization of MPI Software Stacks: The paper identifies clear opportunities for enhancing GPU-Aware MPI, particularly in its handling of collective operations.
Better Noise Mitigation Strategies: While service levels offer some respite, more robust methods to manage adaptive routing and minimize noise impact are essential.
System-Specific Tuning: The research emphasizes the necessity of customized tuning for each supercomputing system, suggesting that one-size-fits-all solutions are inadequate for maximizing performance across diverse architectures.

Conclusion

The findings presented by De Sensi et al. are instructive for both current users of exascale supercomputers and those involved in designing next-generation systems. The delineation of performance bottlenecks and potential optimization paths provides a foundation for informed improvements in supercomputing infrastructure. Future research must continue to hone in on the intricate dynamics of multi-GPU communication, addressing the variability and scaling challenges observed in this thorough investigation.

This paper serves as a crucial reference for system architects and software developers, offering pragmatic insights that can be leveraged to improve both the efficiency and scalability of GPU-to-GPU communications in cutting-edge supercomputing environments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ogawa_tter/status/1828340638502527215

https://twitter.com/gm8xx8/status/1828268765575803235

https://twitter.com/daniele_desensi/status/1857458225332797441

YouTube

Show All Videos

HackerNews

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects (4 points, 2 comments)