Inter-APU Communication on AMD MI300A Systems via Infinity Fabric: a Deep Dive (2508.11298v2)

Published 15 Aug 2025 in cs.DC

Abstract: The ever-increasing compute performance of GPU accelerators drives up the need for efficient data movements within HPC applications to sustain performance. Proposed as a solution to alleviate CPU-GPU data movement, AMD MI300A Accelerated Processing Unit (APU) combines CPU, GPU, and high-bandwidth memory (HBM) within a single physical package. Leadership supercomputers, such as El Capitan, group four APUs within a single compute node, using Infinity Fabric Interconnect. In this work, we design specific benchmarks to evaluate direct memory access from the GPU, explicit inter-APU data movement, and collective multi-APU communication. We also compare the efficiency of HIP APIs, MPI routines, and the GPU-specialized RCCL library. Our results highlight key design choices for optimizing inter-APU communication on multi-APU AMD MI300A systems with Infinity Fabric, including programming interfaces, allocators, and data movement. Finally, we optimize two real HPC applications, Quicksilver and CloverLeaf, and evaluate them on a four MI100A APU system.

Summary

The paper presents a detailed investigation of inter-APU communication strategies on AMD MI300A systems, leveraging Infinity Fabric to maximize GPU performance.
The paper evaluates various programming models and memory allocation techniques using micro-benchmarks and case studies to measure latency and bandwidth improvements.
The paper identifies key optimizations, demonstrating that hardware utilization and specific programming choices can significantly enhance data movement in high-performance computing.

Inter-APU Communication on AMD MI300A Systems via Infinity Fabric: a Deep Dive

The focus of this paper is the exploration and evaluation of data movement strategies in high-performance computing (HPC) nodes using AMD MI300A systems, which integrate CPUs and GPUs within a single package. Four of these APUs are interconnected in a compute node using Infinity Fabric, an interconnect technology that facilitates efficient data communication necessary to harness the full computing power of modern GPUs.

Architecture of AMD MI300A Systems

The AMD MI300A system integrates CPU and GPU components, facilitating unified memory access which contrasts with NVIDIA's NVLink approach maintaining separate memory spaces. Each MI300A APU incorporates 24 CPU cores and 228 GPU compute units, interconnected through Infinity Fabric (Figure 1).

Figure 1: Node Architecture (right) with four MI300A APUs, and detailed APU architecture (left). The Infinity Fabric (in blue) interconnects the four APUs. From the user perspective, each APU is a NUMA node in this cache-coherent NUMA system.

The Infinity Fabric interconnects the APUs, providing a bandwidth of 128 GB/s per direction, essential for designing efficient communication strategies (Figure 2).

Figure 2: A taxonomy of communication on multi-APU systems, associated data movement categories and programming interfaces and libraries.

Communication Taxonomy

The research defines a taxonomy of communication mechanisms on the MI300A platform:

Direct Memory Access (DMA): Facilitates in-kernel memory access from GPU compute units across APUs, ensuring cache coherence and high bandwidth data transfer.
Explicit Data Movement: Utilizes APIs like HIP and standard libraries to move data between memory spaces of different APUs, often leveraging SDMA engines for parallelization and efficiency.
Point-to-Point and Collective Communications: Managed via MPI for distributed processing, leveraging specialized libraries such as RCCL for enhanced GPU-GPU interconnect efficiency.

Evaluation Methodology

The research employs micro-benchmarks and real-world application tests to evaluate data movement efficiency on the MI300A platform:

Latency and Bandwidth Testing: Benchmarks like STREAM, adapted for GPU, measure performance for both local and remote memory accesses, illustrating latency and maximum achievable bandwidth (Figures 5 and 6).
Figure 3: GPU memory access latency, measured with a pointer-chasing approach, for data located locally, or on a neighbour APU.

Figure 4: CPU memory access latency, measured with a pointer-chasing approach, for data either located locally or on a neighbour APU.
Application Case Studies: Optimization of HPC applications, Quicksilver and CloverLeaf, demonstrating improved inter-APU communication efficiency and runtime reduction by leveraging identified optimization strategies (Figures 21 and 23).
Figure 5: End-to-end runtime measured in Quicksilver for all input problems, comparing the impact of XNACK settings and allocators.

Figure 6: End-to-end runtime (in seconds) of CloverLeaf using the original implementation and our implementation using various memory allocators. All adapted versions are marked with ``
''. Average over five runs.*

Key Observations and Optimizations

Allocator Impact: Communication performance is heavily reliant on memory allocation strategies. Using hipMalloc achieves maximum interconnect bandwidths across various MPI and RCCL configurations.
Programming Model Efficiency: GPU-centric interfaces (RCCL) tend to outperform CPU-centric approaches (MPI) for large message sizes due to efficient Infinity Fabric usage. For small messages, MPI's CPU-staging offers lower latency.
Hardware Utilization: Disabling SDMA improves bandwidth for certain MPI communication patterns, although RCCL consistently leverages hardware to achieve full bandwidth.

Practical Implications and Future Work

These findings have practical implications for optimizing data movements in high-density GPU environments, such as those in emerging supercomputing platforms. Efficient data movement sustains application throughput, crucial for simulations and workloads in scientific computing and neural networks. Future work could explore integrating these optimized strategies into middleware frameworks to abstract and automate optimizations in multi-APU systems.

In conclusion, the paper delineates guidelines and strategies for optimizing APU-APU communication on AMD MI300A systems, enabling researchers and engineers to better leverage their computational infrastructure for scalable and efficient performance.