Infinity Fabric Interconnect Overview
- Infinity Fabric Interconnect is AMD’s modular, packet-based system fabric that efficiently links CPUs, GPUs, and memory, offering both explicit and implicit memory access.
- Architectural details include various interconnect links (Quad, Dual, Single, xGMI) which achieve up to 81% of theoretical bandwidth and maintain low latency in HPC deployments.
- Performance relies on tailored software interfaces (hipMemcpy, RCCL, MPI) and NUMA-aware programming to optimize scheduling, data movement, and real-world application speedups.
The Infinity Fabric Interconnect (IFI) is AMD's high-bandwidth, low-latency system fabric connecting CPUs, GPUs, and memory subsystems within a node. In modern high-performance computing (HPC) and multi-GPU platforms, notably those featuring AMD Instinct MI250x and MI300A Accelerated Processing Units (APUs), IFI enables both compute die and device-to-device communication, supporting both explicit and implicit memory access models. The performance and utilization characteristics of this interconnect are a limiting factor for a range of workloads, including scientific simulation and machine learning.
1. Architectural Overview of Infinity Fabric
The Infinity Fabric is architected as a modular packet-based interconnect traversing CPUs, GPUs, and memory controllers. In latest AMD platforms, MI250x comprises dual Graphics Compute Dies (GCDs) per GPU, while the MI300A APU merges CPU and GPU with high-bandwidth memory into a unified NUMA package.
- Link Types in MI250x Systems:
- Quad links: Intra-GPU GCD-to-GCD, bidirectional, 200+200 GB/s theoretical bandwidth.
- Dual links: Inter-GPU, 100+100 GB/s.
- Single links: Inter-GPU, 50+50 GB/s.
The physical medium rests on the xGMI protocol; each xGMI link in MI250x provides 16 bits per transaction at 25 GT/s, yielding 50 GB/s per direction.
- In MI300A Nodes:
- Employ the xGMI-3 interface, 16 bits wide, 32 GT/s, resulting in 64 GB/s unidirectional per link.
- Each pair of APUs is connected by two IF links ( GB/s per direction).
- All four APUs per node are directly and symmetrically interlinked; NUMA diagrams in (Schieffer et al., 15 Aug 2025) illustrate this full bipartite topology and cache-coherent address space.
The schematic below reflects a typical MI300A node interconnect structure:
1 2 3 |
[APU0]-----[APU1] | | [APU2]-----[APU3] |
2. Bandwidth and Latency Characteristics
Bandwidth and latency across the IFI depend on link type, transfer method, and hierarchical topology.
- MI250x:
- Quad: Theoretical GB/s; explicit/DMA transfers yield , i.e., $50$ GB/s; implicit (kernel) achieves GB/s (Pearson, 2023).
- Dual: GB/s; explicit reaches $51$ GB/s ( of peak), implicit saturates $77$ GB/s ().
- Single: GB/s; explicit $38$ GB/s (76%), implicit $38$ GB/s (77%).
- Latency:
- For MI250x GCD pairs, measured with point-to-point benchmarks, typical latencies range $8.7$–s, governed by link count and hop count (in some instances, software selects longer multi-hop routes for bandwidth optimization) (Schieffer et al., 1 Oct 2024).
- MI300A:
- Direct GPU kernel access: STREAM-like copy achieves $103$–$104$ GB/s ( of 128 GB/s).
- Explicit hipMemcpy: For transfers
- KB: $90$ GB/s with SDMA or blit kernel.
- KB: memcpy (CPU caching) minimizes latency.
- MPI point-to-point: With CPU staging, latencies as low as s.
- Collective: RCCL achieves lower collective latencies (by 5–38×) than MPI for KB messages.
Node Type | Link Type | Theoretical BW (GB/s) | Achieved BW (Explicit) | Achieved BW (Kernel) |
---|---|---|---|---|
MI250x | Quad | 200 | 50 () | 154 () |
MI250x | Dual | 100 | 51 () | 77 () |
MI250x | Single | 50 | 38 () | 38 () |
MI300A | Any | 128 | 90 ( approx.) | 104 () |
These results demonstrate that achieved data movement bandwidth is primarily determined by the interface and method as much as by link capacity.
3. Programming Interfaces and Data Movement Methods
Data can be transferred across IFI by several software pathways, with substantial achieved bandwidth and latency variance:
- Explicit (hipMemcpy, hipMemcpyPeer):
- Invoke SDMA engines (dedicated copy hardware).
- Effective on large transfers ( KB), but only accesses a fraction ($25$–) of theoretical link bandwidth, with DMA bottleneck apparent on fast links (Pearson, 2023, Schieffer et al., 1 Oct 2024).
- Implicit (kernel-level):
- GPU kernel accesses remote memory via mapped buffers or device pointers.
- Kernels exploit high concurrency and lead to higher sustained bandwidth (up to of theoretical value), especially for MI250x (Pearson, 2023).
- Managed (hipMallocManaged):
- Coarse grain enables bandwidth comparable to implicit transfers (approximately $74$– of peak).
- Prefetching (hipMemPrefetchAsync) is not effective—observed slowdowns up to (Pearson, 2023).
- MPI and Collective (MPI, RCCL):
- MPI point-to-point may enable lower latency for KB messages if CPU staging is used.
- RCCL (GPU-specialized) delivers lowest latency and highest throughput for large collectives (AllReduce, etc.), outperforming MPI by factors (5× to 38×, depending on operation and message size) (Schieffer et al., 15 Aug 2025).
- Memory allocation: hipMalloc-allocated buffers yield highest and most consistent device-to-device transfer rates regardless of communication API (Schieffer et al., 15 Aug 2025).
- SDMA Engine tuning: Disabling SDMA (e.g., via
HSA_ENABLE_SDMA=0
) allows GPU “blit” copy kernels, in some situations outperforming hardware SDMA (Schieffer et al., 15 Aug 2025).
4. Topological Heterogeneity, NUMA Effects, and System-Level Implications
- Heterogeneous Connectivity:
- MI250x features highly variable topology; not all GCD pairs are “equal.” Critical application performance depends on matching bandwidth-heavy communication to faster quad/dual links and being aware that DMA engines may not saturate these paths (Pearson, 2023, Schieffer et al., 1 Oct 2024).
- MI300A offers uniform, symmetric connectivity; all APU–APU pairs are identically provisioned (full bipartite), simplifying scheduling and placement (Schieffer et al., 15 Aug 2025).
- NUMA Effects:
- In MI250x platforms, no significant NUMA effect on CPU–GPU memory transfer was observed; bandwidth independence from region of host memory allocation (Pearson, 2023, Schieffer et al., 1 Oct 2024).
- In MI300A, the entire node appears as a cache-coherent NUMA shared address space, simplifying programming; pointer-chasing benchmark latencies for remote memory remain below 1 μs.
- Practical Impact for Scheduling and Placement:
- Multi-GPU workloads must be topology-aware, explicitly mapping bandwidth-intensive tasks to optimal interconnects for MI250x (Pearson, 2023).
- Scheduler and runtime system must expose and leverage fabric topology information, especially on systems where not all nodes are equally connected (Schieffer et al., 1 Oct 2024).
5. Benchmarking Methodology and Performance Evaluation
Methodologically rigorous characterization of IFI involves a suite of microbenchmarks and real-application profiling:
- STREAM and Comm|Scope: Employed for sustained bandwidth (kernel and memcpy variants) (Schieffer et al., 1 Oct 2024, Schieffer et al., 15 Aug 2025).
- p2pBandwidthLatencyTest: Used to gather matrices of GCD–GCD or APU–APU latency and bandwidth (Schieffer et al., 1 Oct 2024, Schieffer et al., 15 Aug 2025).
- OSU Micro-Benchmarks: For MPI point-to-point/collective latency and bandwidth (Schieffer et al., 1 Oct 2024).
- Collective Communication Tests: Comparing MPI collectives with RCCL; message size regimes ( KB, KB) delineate the crossover point for interface advantage (Schieffer et al., 15 Aug 2025).
Performance results from MI300A (4-APU node) and MI250x (4-GPU node) show:
- Direct kernel‐level transfers (STREAM-like) give highest device-to-device bandwidth.
- Explicit device-to-device copies: hipMemcpy/hipMemcpyPeer, with SDMA/blit control, achieve variable bandwidth governed by buffer allocation, link configuration, and message size.
- MPI with CPU staging achieves lowest latency for very small messages but does not reach optimal bandwidth for large device-to-device transfers unless buffers are hipMalloc-allocated.
- RCCL collectives consistently outperform MPI collectives for moderate-to-large buffers and deliver high-throughput, balanced node-wide collective performance.
6. Application-Level Optimization and Case Studies
Integration of the IFI into production HPC applications requires both architectural and software interface cognizance:
- Memory allocation policy: Substituting hipMalloc for standard malloc consistently exposes the full device bandwidth; additional combinations with hipHostRegister may provide flexibility when system-side allocation is required (Schieffer et al., 15 Aug 2025).
- SDMA engine disablement: Allowed “blit” GPU copy kernels to outperform SDMA hardware in particular message and allocation regimes; thus, tuning
HSA_ENABLE_SDMA
is an effective knob (Schieffer et al., 15 Aug 2025). - XNACK state (GPU memory fault handling): Disabling yield up to 11% end-to-end application speedup by reducing kernel replay overhead (Schieffer et al., 15 Aug 2025).
In Quicksilver and CloverLeaf (representing Monte Carlo particle transport and hydrodynamics applications, respectively), these optimizations resulted in:
- End-to-end speedups in Quicksilver of up to 11% and 1.5×–2.2× in CloverLeaf from communication path and allocator tuning.
- Reduction of communication-computation overlap bottlenecks via improved data mover interface and layout (Schieffer et al., 15 Aug 2025).
7. Theoretical Modeling and Strategic Implications
A fundamental model for achievable interconnect bandwidth is
where the fraction is scenario-dependent:
- Explicit transfers:
- Kernel-level transfers:
- In real applications, performance modeling must factor in both software-driven and link-driven asymmetry.
System architects and HPC framework developers are advised to:
- Prefer implicit memory access (kernel-level) for bandwidth-bound exchanges.
- Use hipMalloc-allocated buffers across all APIs for consistent device-to-device performance.
- Carefully select and tune software interfaces (MPI, RCCL, HIP), SDMA/blit, and XNACK settings according to message size and access pattern.
- Acknowledge that empirical performance is strongly determined by the match between hardware fabric and API-level data movement mechanisms, not by peak published node numbers alone (Pearson, 2023, Schieffer et al., 1 Oct 2024, Schieffer et al., 15 Aug 2025).
Optimization strategies must be revisited for each generation: MI300A’s uniform connectivity simplifies scheduling and placement decisions, contrasting with MI250x’s heterogeneity, which demands fine-grained bandwidth-aware and topology-aware partitioning and mapping to fully realize the interconnect’s capacity.