Infinity Fabric: AMD Scalable Interconnect
- Infinity Fabric is AMD’s scalable interconnect architecture that links CPUs, GPUs, and APUs using diverse link tiers to enable efficient data movement.
- It features heterogeneous link configurations, with MI250x systems offering variable bandwidth tiers (quad, dual, and single links) and MI300A nodes providing uniform high-speed mesh connectivity.
- Optimizing programming models and memory allocation strategies is crucial for leveraging its topology-aware performance to enhance HPC application efficiency.
Infinity Fabric is AMD's scalable interconnect architecture that provides high-bandwidth, low-latency connectivity among CPUs, GPUs, and other computing resources within a node, and across multiple nodes in high-performance computing (HPC) systems. It is foundational in AMD's heterogeneous and unified memory systems, underpinning both MI250x multi-GPU configurations and the MI300A Accelerated Processing Unit (APU), where it serves as the primary data movement backbone. Infinity Fabric delivers communication rates up to 128 GB/s per link (MI300A) and exhibits link heterogeneity in MI250x systems, with distinct bandwidths across “single,” “dual,” and “quad” link tiers. Performance outcomes and software visibility of these hardware traits are central topics in recent research, with direct implications for programming models and HPC application optimization (Pearson, 2023, Schieffer et al., 1 Oct 2024, Schieffer et al., 15 Aug 2025).
1. Infinity Fabric Node Architecture and Topology
Infinity Fabric underpins the node-level architecture of AMD MI250x and MI300A systems, enabling interconnections between CPUs, GPUs (Graphics Compute Dies—GCDs or XCDs), and, in the case of MI300A, within single-chip APUs.
MI250x Systems
- Each node comprises one AMD EPYC CPU and four MI250x GPUs.
- Every MI250x GPU contains two GCDs, yielding 8 HIP-addressable GPUs per node.
- CPU-GPU links via Infinity Fabric provide 72+72 GB/s per GCD.
- GCD-to-GCD connections exhibit bandwidth heterogeneity:
- Quad links (intra-GPU): 200+200 GB/s.
- Dual links (inter-GPU): 100+100 GB/s.
- Single links (inter-GPU): 50+50 GB/s.
- The topology produces non-uniform connectivity, illustrated in the following schematic:
1 2 3 4 |
[CPU] | [GCD0]-quad(200GB/s)-[GCD1] | \dual(100GB/s)/ [GCD2]--single(50GB/s)--[GCD3] |
MI300A Systems
- Each node has four MI300A APUs, each integrating:
- 24 Zen 4 CPU cores (3 CCDs)
- 1 GPU (6 XCDs)
- 128 GB HBM3 memory
- APUs are directly interconnected by Infinity Fabric (xGMI 3 protocol):
- Each IF link: 16 bits wide, 32 GT/s ⇒ 64 GB/s per direction.
- Two links per APU pair produce 128 GB/s bidirectional bandwidth.
- Topology: Fully connected mesh, each APU directly links to all others.
This uniform topology in MI300A contrasts with MI250x's non-uniform, multi-hop arrangements (Schieffer et al., 15 Aug 2025).
2. Link Bandwidth, Latency, and Heterogeneity
Infinity Fabric's implementation results in pronounced bandwidth and latency characteristics subject to topology, hardware, and transfer mechanisms.
MI250x Link Heterogeneity
- Performance varies by link type:
- Quad: theoretical 200 GB/s, measured ≈153 GB/s (implicit mapped, 0.77× peak).
- Dual: theoretical 100 GB/s, measured ≈77 GB/s.
- Single: theoretical 50 GB/s, measured ≈39 GB/s.
- Explicit DMA transfers (hipMemcpyAsync): saturate at ~51 GB/s across all link tiers, indicating DMA engine limiting factor rather than link capacity.
- Table of Fractions of Theoretical Peak Bandwidth:
| Transfer Method | Quad (200GB/s) | Dual (100GB/s) | Single (50GB/s) |
|---|---|---|---|
| Explicit (DMA) | 0.25 | 0.51 | 0.76 |
| Implicit Mapped | 0.77 | 0.77 | 0.78 |
| Implicit Managed | 0.74 | 0.76 | 0.76 |
| Prefetch Managed | 0.016 | 0.032 | 0.064 |
Only implicit mapped GPU accesses approach link saturation, especially for quad links (Pearson, 2023).
MI300A Uniform Bandwidth
- Each direct APU-APU connection: theoretical 128 GB/s; measured up to 103–104 GB/s (81% of peak) for kernel remote access.
- Latency:
- Local HBM3: 346 ns (GPU), 240 ns (CPU)
- Remote via IF: 690 ns (GPU), 500 ns (CPU)
- Latency nearly doubles when crossing IF.
Practical Observations
- Bandwidth realized in applications depends on physical topology (MI250x) and transfer method.
- Latency and bandwidth are further modulated by software stack, memory allocator choice, and communication library (RCCL, MPI) (Schieffer et al., 15 Aug 2025).
3. Programming Models, API Visibility, and Memory Allocation Strategies
Infinity Fabric's hardware features are exposed to users through programming APIs, notably AMD HIP/ROCm, MPI, and RCCL. The topological and performance nuances require explicit consideration by library writers and application programmers.
- HIP/ROCm API: Device ordinal and communication calls map directly to GCD topology; heterogeneity is visible and not abstracted.
- Optimal performance demands topology-aware querying and explicit peer enablement (
hipPeerAccessEnable).
- Optimal performance demands topology-aware querying and explicit peer enablement (
- MPI vs. RCCL:
- RCCL leverages hardware topology more directly, providing lower latency and higher bandwidth for collectives (except for broadcast, where MPI/RCCL are comparable) (Schieffer et al., 1 Oct 2024, Schieffer et al., 15 Aug 2025).
- MPI overheads dominate for small messages or when multiple processes per node map poorly onto hardware.
- Memory Allocation:
- Pinned (
hipMalloc/hipHostMalloc) buffers consistently yield highest bandwidth for explicit transfers. - Managed/zero-copy memory offers competitive bandwidth for small transfers (<32 MB) but degrades for large sizes.
- In MI300A, RCCL is allocator-agnostic for collective bandwidth, whereas MPI requires hipMalloc to reach peak performance.
- Pinned (
No NUMA effects are observed in host-to-GPU transfer bandwidth in current hardware (allocation on any NUMA domain yields equivalent performance) (Pearson, 2023, Schieffer et al., 1 Oct 2024).
4. Communication Patterns and Application-Level Performance
Infinity Fabric's impact manifests in actual performance for point-to-point, peer-to-peer, and collective HPC workloads.
- Direct kernel access (STREAM-type or custom kernels) can achieve near maximum bandwidth for remote memory in both MI250x and MI300A.
- Explicit transfer APIs (hipMemcpy, hipMemcpyPeer) are DMA-bound, limiting bandwidth utilization on links with higher theoretical capacity.
- Collectives:
- RCCL delivers 88–90 GB/s for large messages, with latency scaling linearly with message size.
- MPI outperforms RCCL for messages <4 KB due to lower baseline latency but falls behind for larger messages, where RCCL is 5–38× lower latency.
- Real Applications:
- Optimizing communication in Quicksilver via allocator selection yields 5–11% speedup.
- In CloverLeaf, switching to RCCL collectives and hipMalloc delivers up to 2.15× speedup in communication phase, and ~2.2× total runtime improvement over MPI (Schieffer et al., 15 Aug 2025).
Quantitative Performance Summary
| Scenario | API/Lib | Allocator | Bandwidth (GB/s) | Latency | Comments |
|---|---|---|---|---|---|
| Direct kernel remote access | Kernel | hipMalloc | 103–104 | 690 ns | 81% of IF peak (MI300A) |
| hipMemcpy (large buffer) | HIP | hipMalloc | ~90 | >1 µs | SDMA/copy kernel equal |
| RCCL p2p/collective | RCCL | any | 88 | ~20 µs | Allocator-insensitive |
| MPI p2p (hipMalloc, SDMA off) | MPI | hipMalloc | 90.3 | ~4.8 µs | Direct GPU path |
| MPI p2p (malloc) | MPI | malloc | 11.7 | ~2 µs | CPU staging dominates |
| Collective (>4KB) | RCCL | any | — | 5–38× MPI | Linear scaling, large msgs |
5. Design Implications, Optimization Strategies, and Forward-Looking Considerations
The deployment of Infinity Fabric in both MI250x and MI300A nodes necessitates rigorous topology- and communication-aware software design.
- On MI250x, developers must program with direct knowledge of the non-uniform link structure. Task mapping and peer memory accesses can yield up to 4× speed variation depending on pairwise device placement.
- On MI300A, fully uniform high-bandwidth links simplify programming. The presence of cache-coherent NUMA, shared HBM3, and Infinity Cache allows robust code mobility and efficient data placement.
- Best practices:
- Use implicit mapped access for device-device transfers (enables link saturation except for DMA-bound cases).
- Employ RCCL for intra-node collectives, especially at scale, and hipMalloc-allocated buffers for MPI if high bandwidth is required.
- For explicit transfers <512 KB,
memcpyis optimal; for larger, preferhipMemcpywith hipMalloc. - Evaluate SDMA (enabling/disabling) and allocator choices in real world codes to maximize throughput.
As hardware generations shift towards tighter CPU-GPU integration (e.g., MI300), the abstraction of communication, link uniformity, and memory coherency through Infinity Fabric may evolve, impacting programming paradigms and performance portability.
6. Common Misconceptions and Technical Limitations
A frequent misconception is that all device pairs in multi-GPU systems connected by Infinity Fabric enjoy identical bandwidth and latency; in MI250x, heterogeneity is substantial due to the physical topology. Another is that communication libraries automatically exploit all available link bandwidth—in reality, explicit DMA engines and software interfaces frequently fail to saturate the fastest links, requiring explicit kernel-level or managed memory strategies to do so.
No NUMA effects on host-GPU bandwidth are observed in current MI250x, but contention and scale can introduce bottlenecks. In MI300A, the uniform mesh greatly reduces complexity but careful selection of allocators and APIs remains essential for saturated bandwidth, as shown in application benchmarks.
7. Summary and Research Directions
Infinity Fabric establishes a high-performance, configurable, and cache-coherent interconnect in heterogeneous AMD systems. Its realized application-level throughput and latency depend on link topology, API exposure, memory allocation, and programming practices. MI250x systems exhibit non-uniformity and complex performance mappings, while MI300A nodes deliver uniform, scalable bandwidth. The ongoing optimization of collective communication, explicit data movement, and real-world code allocation reveals substantial advantages for applications tuned to device-aware, topology-conscious usage of Infinity Fabric. Further investigation into extended abstraction, auto-tuning, and compiler- or runtime-guided communication mapping emerges as a plausible direction to approach hardware performance limits.