AMD Instinct MI300A APU: Unified HPC Processor

Updated 2 June 2026

AMD Instinct MI300A APU is a high-performance heterogeneous processor integrating multi-core Zen4 CPUs and CDNA3 GPUs with unified high-bandwidth memory for exascale computing.
Its innovative design eliminates traditional CPU–GPU data movement barriers by providing a physically unified, hardware-coherent memory pool accessible by both processing units.
Benchmark studies show significant speedups (4–22×) over conventional architectures, making it ideal for HPC, AI, and advanced data analytics workloads.

The AMD Instinct MI300A Accelerated Processing Unit (APU) is a high-performance heterogeneous microprocessor designed for data center and exascale high-performance computing (HPC) systems. It integrates multi-core Zen4 CPU complexes and CDNA3-based GPU compute dies into a single chiplet package, unified by a high-bandwidth on-package memory. The MI300A establishes a new architectural design point in the HPC ecosystem by providing a physically unified, hardware-coherent memory system with single address space accessibility by both CPUs and GPUs. It underpins leadership-class machines such as El Capitan, the Hunter supercomputer at HLRS, and other exascale facilities.

1. Die Composition and Unified Memory Architecture

The MI300A APU comprises three 8-core Zen4 CPU core complex dies (CCDs) and six CDNA3 GPU compute dies (XCDs) connected via AMD’s Infinity Fabric, together with four I/O dies (IODs) hosting a 256 MB Infinity Cache and attaching eight HBM3 stacks (128 GB per APU) (Wahlgren et al., 18 Aug 2025, Schieffer et al., 15 Aug 2025). The entire 128 GB HBM3 subsystem is directly visible to all CCDs and XCDs, forming a single 48-bit physical address space. This design eliminates the traditional PCIe/NVLink separation between CPU and GPU memory, collapsing host–device data movement into a physically shared, hardware-coherent memory pool (Wahlgren et al., 18 Aug 2025, Tandon et al., 2024).

The cache hierarchy is multi-level on both CPU and GPU sides: Each Zen4 core has private L1/L2 and participates in a 96 MB (or more, depending on configuration) L3. CDNA3 XCDs provide per-SIMD L1 and per-XCD L2 caches, with the Infinity Cache acting as a non-coherent shared last-level cache between compute engines and HBM (Wahlgren et al., 18 Aug 2025, Schieffer et al., 15 Aug 2025). Coherence between CCDs and XCDs is carried over Infinity Fabric; cache lines below L3 are managed via a MESI-like protocol, while the IC serves as a large snoop filter.

Unified memory is realized via hardware-native unified physical memory (UPM): the CPU and GPU maintain separate page tables, but all valid physical memory is mapped identically and managed coherently by the Linux HMM subsystem. Page-fault handling and TLB management are handled in hardware to enable low-latency, high-throughput cross-domain memory access (Wahlgren et al., 18 Aug 2025).

2. Compute Subsystem: CPU–GPU Integration

Each MI300A exposes up to 24 Zen4 CPU cores and 228–240 CDNA3 GPU compute units (CU) per package (Schieffer et al., 15 Aug 2025, Jarmusch et al., 10 Feb 2026). The CPU clusters deliver multi-socket-class throughput, supporting two-way simultaneous multithreading (SMT), with typical L1 = 32 KiB, L2 = 1 MiB per core, and a shared L3 upwards of 96 MB (or 384 MB in some variants).

The GPU subsystem comprises up to 240 CUs, each with 64-wide wavefronts (SIMT organization). CDNA3 adds matrix-core (MFMA) engines per CU, with native support for FP8, FP16, FP32, and FP64 operations. All GPU compute units share the on-package HBM3 pool, delivering aggregate bandwidths up to 3.5–5.6 TB/s depending on allocation policy and workload. Measured peak sustainable bandwidth in microbenchmarks is 3.2–3.5 TB/s for hipMalloc-allocated device buffers (Wahlgren et al., 18 Aug 2025, Sfiligoi, 7 May 2025, Jarmusch et al., 10 Feb 2026).

Matrix-core performance is exposed via full support for mixed-precision MFMA, FP8 inference, and structured (2:4) sparsity (Jarmusch et al., 10 Feb 2026). Occupancy-aware scheduling is essential to reach peak arithmetic throughput: full FP8 performance requires at least 256 active wavefronts per chip (Jarmusch et al., 10 Feb 2026).

3. Unified Physical Memory (UPM) System Characterization

UPM exposes a single, hardware-coherent physical memory address space with direct, low-latency access by both CPUs and GPUs (Wahlgren et al., 18 Aug 2025). Measured average access latencies are as follows (32/64 byte lines):

Level	Latency (ns, avg) – GPU	Latency (ns, avg) – CPU
L1	57	N/A
L2	100–108	N/A
Infinity Cache (IC)	205–218	N/A
HBM	333–350	236–241

hipMalloc allocations maximize bandwidth and minimize TLB misses due to larger adaptive fragment usage; other allocators yield 1.8–2.2× lower bandwidth due to increased page faults and suboptimal channel interleaving (Wahlgren et al., 18 Aug 2025).

Software page fault handling, with XNACK enabled, achieves up to ~9 million pages/s on the GPU and ~3.7 million on the CPU, with first-touch policies (GPU/CPU) impacting minor versus major fault rates. For best utilization, applications should pre-initialize data with the agent (CPU/GPU) that will predominately access the region (Wahlgren et al., 18 Aug 2025, Iwabuchi et al., 26 May 2026).

UPM is fundamentally distinct from Unified Virtual Memory (UVM) on discrete GPU platforms: UPM has no migration penalties, zero-copy data movement, a fixed HBM size, and deterministic physical placement (Wahlgren et al., 18 Aug 2025, Tandon et al., 2024).

4. Programming Model: OpenMP, HIP/ROCm, and Memory Allocators

Applications ported to MI300A exploit the unified memory system via directive-based programming models or hardware-managed runtime allocation (Tandon et al., 2024, Wahlgren et al., 18 Aug 2025). OpenMP 5.2 features, especially “requires unified_shared_memory,” allow developers to offload compute kernels to the GPU with minimal data declaration overhead. Base pointers are transparently passed between host and device, eliminating the need for explicit mapping/unmapping or separate buffer management (Tandon et al., 2024).

Porting strategies replace discrete host/device staging, explicit memcpy, and data mapping clauses with unified allocations and single pool double buffering. Existing common pitfalls include lattice-based GPU TLB fragmentation, performance pathologies for std::vector or static host variables, and suboptimal allocator usage. Custom allocator patterns (e.g., wrapping std::vector with hipMalloc) are recommended for optimal bandwidth in C++ codes (Wahlgren et al., 18 Aug 2025).

HIP/ROCm is the canonical programming stack, with both hipMalloc and hipMallocManaged allocations supported. The Umpire memory pool allocator improves performance for workloads with large, frequent sub-allocations by reducing Fortran ALLOCATE/DEALLOCATE stalls (Dhar et al., 28 May 2026).

5. Inter-APU Communication and Infinity Fabric Design

Within a multi-node system, four MI300A APUs typically compose a compute node in a symmetric mesh. Each APU connects via six xGMI 3 links, achieving 128 GB/s per peer bidirectionally (Schieffer et al., 15 Aug 2025). Data movement between APUs is orchestrated over the Infinity Fabric interconnect and presents as a NUMA domain to both OS and software.

For inter-APU and collective communication:

Best bandwidth is attained with hipMalloc buffers used in combination with hipMemcpy (up to 90 GB/s per peer).
MPI with CPU staging achieves the lowest latency (1.9 μs for small buffers); for large buffers and collectives, the GPU-resident RCCL library yields 88 GB/s and 2–10× better scaling for large collective operations (Schieffer et al., 15 Aug 2025).
Allocator and first-touch policies are critical; hipMalloc-allocated buffers are mandatory for high bandwidth, with XNACK=1 required for RCCL collectives.

Latency and bandwidth models:

$L_{\mathrm{GPU,local}} \approx 346\,\mathrm{ns},\quad L_{\mathrm{GPU,remote}} \approx 690\,\mathrm{ns}$

For message copy:

$L(N)\approx\alpha+\beta N,\quad \alpha\approx0.7\,\mu\mathrm{s},\,\beta\approx0.001\,\mu\mathrm{s}/\mathrm{byte}$

6. Algorithmic Case Studies and Application Benchmarks

The MI300A’s unified memory system has been evaluated across diverse HPC workflows:

Monte Carlo neutron transport (MC/DC): 12× speedup over 112-core Sapphire Rapids for C5G7 benchmarks; 4× for pin-cell over same CPU baseline. JIT-compiled Numba kernels (via ROCm/LLVM) require no explicit copy or staging, and atomics are implemented via a C++ device runtime (Morgan et al., 9 Jan 2025).
OpenFOAM CFD: Minimal OpenMP directive augmentation yields ≈4× speedup versus Nvidia H100, with >65% of discrete GPU runtimes in page migration eliminated by APU unified memory (Tandon et al., 2024).
PERMANOVA (memory-bound statistics): 22.5× GPU speedup over brute-force CPU, 3× over tiled+SMT CPU; STREAM-triad achieves 3.16 TB/s on GPU, 0.21 TB/s on CPU; tiling is counterproductive for GPU (Sfiligoi, 7 May 2025).
Neighbor graph construction (SOLANET): Lock-free GPU NN-Descent, leveraging hipMalloc and MPI one-sided comms, achieves 8.3×–40× CPU–GPU strong scaling on up to 512 APUs for billion-point graphs. Unified memory simplifies distributed data exchange, avoiding host staging (Iwabuchi et al., 26 May 2026).
Direct numerical simulation (DNS, FS3D code): Umpire-pooled USM outperforms ALLOCATE-heavy strategies; 4× speedup for droplet simulation (4096³ cells), near linear weak scaling to 512 APUs (Dhar et al., 28 May 2026).
Kinetic plasma simulation (iPIC3D): 32,768 APUs, sustained 22.4 PFLOP/s FP64, near-teraflop-per-APU; high-exploitability via unified HBM for 33T particles, implicit Maxwell solve, and hybrid CPU–GPU particle control (Markidis et al., 28 Jul 2025).

7. Advanced Accelerator Features: Matrix Cores and Sparsity

CDNA3’s MFMA engines offer native FP8, FP16, FP32, and FP64 support. Peak theoretical FP8 throughput approaches 38.4 PFLOP/s per APU (240 CUs × 4 MFMA/CU × 2.5 GHz × 16 FMA/tile) (Jarmusch et al., 10 Feb 2026). MFMA instructions achieve a chain latency of 2.46×10⁻⁵ ms for 16×16×32 tile, but steady-state performance requires high occupancy (≥256 wavefronts).

Asynchronous Compute Engines (ACE) permit overlapping kernel streams; 2–3× speedup is obtainable with 4–8 streams at the cost of fairness—OE efficiency up to 0.36 at N=8, but per-stream latency can increase ≤60× under heavy contention.

2:4 structured sparsity is supported in hardware, delivering up to 1.3× per-stream speedup and 7% fairness improvement in concurrent workloads; isolated (single-stream) gains are limited by software (rocSPARSE) overhead (Jarmusch et al., 10 Feb 2026).

References

(Wahlgren et al., 18 Aug 2025) Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs
(Schieffer et al., 15 Aug 2025) Inter-APU Communication on AMD MI300A Systems via Infinity Fabric: a Deep Dive
(Tandon et al., 2024) Porting HPC Applications to AMD Instinct $^\text{TM}$ MI300A Using Unified Memory and OpenMP
(Jarmusch et al., 10 Feb 2026) Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A
(Morgan et al., 9 Jan 2025) Enabling GPU Portability into the Numba-JITed Monte Carlo Particle Transport Code MC/DC
(Iwabuchi et al., 26 May 2026) SOLANET: Distributed Neighbor Graph Construction on GPU-Accelerated Systems
(Dhar et al., 28 May 2026) The Role of Interfacial Tension in Direct Numerical Simulations of Drop-Film Interaction for Immiscible Fluids
(Markidis et al., 28 Jul 2025) Exascale Implicit Kinetic Plasma Simulations on El~Capitan for Solving the Micro-Macro Coupling in Magnetospheric Physics
(Sfiligoi, 7 May 2025) Comparing CPU and GPU compute of PERMANOVA on MI300A

Summary Table: Key MI300A Metrics

Metric	Value / Notes	Source
CPU Cores per APU	24–96 Zen4	(Wahlgren et al., 18 Aug 2025, Sfiligoi, 7 May 2025)
GPU Compute Units (CUs)	228–240 CDNA3	(Jarmusch et al., 10 Feb 2026, Wahlgren et al., 18 Aug 2025)
On-Package HBM3	128 GB (8×16 GB stacks)	(Wahlgren et al., 18 Aug 2025, Schieffer et al., 15 Aug 2025)
Peak HBM3 Bandwidth	3.5–5.6 TB/s	(Wahlgren et al., 18 Aug 2025, Schieffer et al., 15 Aug 2025, Sfiligoi, 7 May 2025)
GPU STREAM-triad BW (hipMalloc)	3.5–3.6 TB/s	(Wahlgren et al., 18 Aug 2025, Sfiligoi, 7 May 2025)
CPU STREAM BW	0.208 TB/s	(Wahlgren et al., 18 Aug 2025, Sfiligoi, 7 May 2025)
FP8 Matrix Peak Compute	~38.4 PFLOP/s	(Jarmusch et al., 10 Feb 2026)
Node Interconnect	Infinity Fabric, 128 GB/s/peer	(Schieffer et al., 15 Aug 2025)
Application Speedups (vs CPU)	4–22× (app. specific)	(Morgan et al., 9 Jan 2025, Sfiligoi, 7 May 2025, Dhar et al., 28 May 2026, Tandon et al., 2024)

The AMD Instinct MI300A APU establishes a novel point in HPC node architecture by integrating high-core-count CPUs, massive CDNA3 GPU arrays, and a physically unified, high-bandwidth memory subsystem with native hardware coherence. This enables simplified programming, robust multi-node scaling, and bandwidth-constrained workloads to achieve near-peak device performance—the core enabler for exascale simulation, AI, and data analytics workloads.