Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 188 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 57 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Inter-APU Communication on AMD MI300A Systems via Infinity Fabric: a Deep Dive (2508.11298v2)

Published 15 Aug 2025 in cs.DC

Abstract: The ever-increasing compute performance of GPU accelerators drives up the need for efficient data movements within HPC applications to sustain performance. Proposed as a solution to alleviate CPU-GPU data movement, AMD MI300A Accelerated Processing Unit (APU) combines CPU, GPU, and high-bandwidth memory (HBM) within a single physical package. Leadership supercomputers, such as El Capitan, group four APUs within a single compute node, using Infinity Fabric Interconnect. In this work, we design specific benchmarks to evaluate direct memory access from the GPU, explicit inter-APU data movement, and collective multi-APU communication. We also compare the efficiency of HIP APIs, MPI routines, and the GPU-specialized RCCL library. Our results highlight key design choices for optimizing inter-APU communication on multi-APU AMD MI300A systems with Infinity Fabric, including programming interfaces, allocators, and data movement. Finally, we optimize two real HPC applications, Quicksilver and CloverLeaf, and evaluate them on a four MI100A APU system.

Summary

  • The paper presents a detailed investigation of inter-APU communication strategies on AMD MI300A systems, leveraging Infinity Fabric to maximize GPU performance.
  • The paper evaluates various programming models and memory allocation techniques using micro-benchmarks and case studies to measure latency and bandwidth improvements.
  • The paper identifies key optimizations, demonstrating that hardware utilization and specific programming choices can significantly enhance data movement in high-performance computing.

Inter-APU Communication on AMD MI300A Systems via Infinity Fabric: a Deep Dive

The focus of this paper is the exploration and evaluation of data movement strategies in high-performance computing (HPC) nodes using AMD MI300A systems, which integrate CPUs and GPUs within a single package. Four of these APUs are interconnected in a compute node using Infinity Fabric, an interconnect technology that facilitates efficient data communication necessary to harness the full computing power of modern GPUs.

Architecture of AMD MI300A Systems

The AMD MI300A system integrates CPU and GPU components, facilitating unified memory access which contrasts with NVIDIA's NVLink approach maintaining separate memory spaces. Each MI300A APU incorporates 24 CPU cores and 228 GPU compute units, interconnected through Infinity Fabric (Figure 1). Figure 1

Figure 1: Node Architecture (right) with four MI300A APUs, and detailed APU architecture (left). The Infinity Fabric (in blue) interconnects the four APUs. From the user perspective, each APU is a NUMA node in this cache-coherent NUMA system.

The Infinity Fabric interconnects the APUs, providing a bandwidth of 128 GB/s per direction, essential for designing efficient communication strategies (Figure 2). Figure 2

Figure 2: A taxonomy of communication on multi-APU systems, associated data movement categories and programming interfaces and libraries.

Communication Taxonomy

The research defines a taxonomy of communication mechanisms on the MI300A platform:

  1. Direct Memory Access (DMA): Facilitates in-kernel memory access from GPU compute units across APUs, ensuring cache coherence and high bandwidth data transfer.
  2. Explicit Data Movement: Utilizes APIs like HIP and standard libraries to move data between memory spaces of different APUs, often leveraging SDMA engines for parallelization and efficiency.
  3. Point-to-Point and Collective Communications: Managed via MPI for distributed processing, leveraging specialized libraries such as RCCL for enhanced GPU-GPU interconnect efficiency.

Evaluation Methodology

The research employs micro-benchmarks and real-world application tests to evaluate data movement efficiency on the MI300A platform:

  • Latency and Bandwidth Testing: Benchmarks like STREAM, adapted for GPU, measure performance for both local and remote memory accesses, illustrating latency and maximum achievable bandwidth (Figures 5 and 6). Figure 3

    Figure 3: GPU memory access latency, measured with a pointer-chasing approach, for data located locally, or on a neighbour APU.

    Figure 4

    Figure 4: CPU memory access latency, measured with a pointer-chasing approach, for data either located locally or on a neighbour APU.

  • Application Case Studies: Optimization of HPC applications, Quicksilver and CloverLeaf, demonstrating improved inter-APU communication efficiency and runtime reduction by leveraging identified optimization strategies (Figures 21 and 23). Figure 5

    Figure 5: End-to-end runtime measured in Quicksilver for all input problems, comparing the impact of XNACK settings and allocators.

    Figure 6

    Figure 6: End-to-end runtime (in seconds) of CloverLeaf using the original implementation and our implementation using various memory allocators. All adapted versions are marked with ``

    ''. Average over five runs.*

Key Observations and Optimizations

  1. Allocator Impact: Communication performance is heavily reliant on memory allocation strategies. Using hipMalloc achieves maximum interconnect bandwidths across various MPI and RCCL configurations.
  2. Programming Model Efficiency: GPU-centric interfaces (RCCL) tend to outperform CPU-centric approaches (MPI) for large message sizes due to efficient Infinity Fabric usage. For small messages, MPI's CPU-staging offers lower latency.
  3. Hardware Utilization: Disabling SDMA improves bandwidth for certain MPI communication patterns, although RCCL consistently leverages hardware to achieve full bandwidth.

Practical Implications and Future Work

These findings have practical implications for optimizing data movements in high-density GPU environments, such as those in emerging supercomputing platforms. Efficient data movement sustains application throughput, crucial for simulations and workloads in scientific computing and neural networks. Future work could explore integrating these optimized strategies into middleware frameworks to abstract and automate optimizations in multi-APU systems.

In conclusion, the paper delineates guidelines and strategies for optimizing APU-APU communication on AMD MI300A systems, enabling researchers and engineers to better leverage their computational infrastructure for scalable and efficient performance.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 12 tweets and received 36 likes.

Upgrade to Pro to view all of the tweets about this paper:

Reddit Logo Streamline Icon: https://streamlinehq.com