AMD NPUs: Architecture and Innovation

Updated 17 October 2025

AMD's NPUs are domain-specific accelerators designed to optimize deep learning workloads through explicit data movement and spatial organization.
They employ a tiled architecture (XDNA) and a bare-metal programming toolchain (IRON/MLIR-AIR) to enhance throughput under strict energy and latency constraints.
Advanced memory management via NeuMMU and dynamic scheduling techniques improve both performance and energy efficiency across heterogeneous workloads.

AMD's Neural Processing Units (NPUs) are domain-specific accelerators, introduced as part of the AMD Ryzen AI platform and built to provide efficient computation and data movement for deep learning workloads on consumer and edge devices. AMD’s NPUs employ spatial architectures (XDNA) that organize compute and memory resources into two-dimensional grids, emphasize software-managed scratchpad memories over hardware caches, and support bare-metal programming to allow low-level control over data movement, compute scheduling, and parallelism. Their integration with host CPUs and GPUs facilitates heterogeneous workload execution, tailored to both AI inference and training, under stringent energy and latency constraints.

1. Architectural Features and Spatial Organization

AMD’s NPUs use a tiled spatial architecture, referred to as XDNA in production variants, where resources are organized as compute (AI engine) cores, memory cores, and shim cores interfacing with shared host memory (Rösti et al., 3 Apr 2025). Each AI engine core is a VLIW processor supporting vectorized fused-multiply-add operations, able to execute up to 128 FMA ops per cycle on bfloat16 inputs (reaching 256 GFLOP/s at 1 GHz per core). Memory cores (e.g., with 512 KB local storage) facilitate data reuse, explicit distribution, and spatial tiling. Data movement between DRAM (L3), intermediate buffers (L2, L1), and scratchpads is managed explicitly by software via DMA. The architecture eschews traditional hardware caching, prioritizing software-managed locality for maximum throughput and energy efficiency (Rösti et al., 3 Apr 2025, Deshmukh et al., 25 Aug 2025).

2. Software and Compilation Stack

Bare-metal programming is supported via AMD’s IRON toolchain: users describe data movements and per-core compute kernels in Python, which are compiled to MLIR and further lowered to device binaries (xclbin) (Rösti et al., 3 Apr 2025). To ease programming complexity and expose spatial, asynchronous structure, the MLIR-AIR compiler framework was introduced (Wang et al., 16 Oct 2025). MLIR-AIR provides the AIR dialect, supporting operations like air.launch (task offload), air.herd (spatial grouping of work units), air.segment (resource locality), and explicit memory channels via air.channel.put/get. This structure enables the compiler to tile computations, distribute workloads spatially, overlap communication and compute, and manage dependencies, resulting in efficient mapping of AI workloads (e.g., tiled GEMM, fused attention) with up to 78.7% compute efficiency and near parity to hand-optimized MLIR-AIE kernels (Wang et al., 16 Oct 2025).

Layer/Component	Architectural Trait	Programming/Control
AI Engine Core	VLIW, vector FMA, up to 256 GFLOP/s/core	MLIR-AIR, IRON, C++ kernel, MLIR-AIE
Memory Core	Local buffer (e.g., 512 KB)	Explicit tiling/Scheduling
Shim Core	DRAM interface	Software-managed DMA/channel ops

Software-managed caches and tiling, together with fused kernels and explicit scheduling, make NPUs suitable for both inference and training, with the ability to overlap compute and data movement for high throughput.

3. Memory Management and Address Translation

NPUs exhibit bursty, tile-based DRAM accesses, which traditional GPU MMUs cannot serve efficiently. To match the bursty nature of SPM-to-DRAM transfers, NeuMMU was proposed: it includes the Pending Request Merging Buffer (PRMB), scales parallel page-table walkers (PTWs)—up to 128 concurrent walkers—and uses Translation Path Registers (TPreg) to cache hierarchical page-table indices (Hyun et al., 2019). This architecture achieves only 0.06% overhead versus oracular MMUs, reduces energy by over 16×, and enables memory oversubscription and direct remote access (NUMA-style), improving both performance and energy efficiency for dense and sparse workload scenarios. For a tile of size $T$ with page size $P$ , the number of page translations needed is $n_{pages} = \frac{T}{P}$ , e.g., $n_{pages} \approx \frac{5 \times 10^6}{4 \times 10^3} = 1250$ page walks for a 5 MB tile.

4. Performance and Reliability Techniques

Reliability-aware quantization combats transistor aging by dynamically reducing bit-widths of activations, weights, and biases fed to the MAC units, eliminating the need for performance-sapping guardbands. This method achieves a 23% performance gain with only a modest 3% average accuracy loss over a decade of aging, and energy reductions up to 67% (Salamin et al., 2021). The adaptive quantization mechanism selects optimal compression parameters that guarantee timing constraints post-aging, can be applied at the MAC input level, and integrates with standard quantization libraries (e.g., ACIQ, LAPQ).

5. Dynamic Scheduling, Kernel Optimization, and Automation

In heterogeneous AMD SoCs (CPU+GPU+NPU), performance depends critically on workload-aware, dynamic scheduling. Real-time generative AI applications exhibit diverse model phases (e.g., LLM prefill—compute bound; decode—bandwidth bound) and require schedulers sensitive to time-to-first-token (TTFT) and deadlines (Karami et al., 19 Jul 2025). NPUs excel at compute-bound phases and at short-sequence CNNs but lose to GPUs when bandwidth dominates. The First Token First (FTF) policy introduced in (Karami et al., 19 Jul 2025) gives TTFT deadline priority and dynamically adapts resource allocation, reducing deadline violation rate by up to 41.7%.

Compiler optimizations are critical: Tensor Slicing Optimization (TSO) models DRAM burst cost and tiling at both processor-cluster and per-core levels, yielding up to 21.7% speedup in CNN workloads by aligning slices to DRAM burst boundaries (Sousa et al., 2023). For attention workloads, frameworks like Zen-Attention fuse matrix multiplications, bias/mask addition, and SoftMax to minimize DRAM accesses, compute optimal tiling based on buffer constraints, and apply hybrid transpose/padding strategies—leading to up to 4× attention latency reduction and 32% network latency improvement (Deshmukh et al., 25 Aug 2025). FastAttention adapts FlashAttention2 to NPUs via two-level tiling, tiling mask (for memory savings), and blockwise AllReduce (for communication reduction in multi-NPU), achieving up to 10.7× operator speedups and 5.16× LLM throughput gains (Lin et al., 22 Oct 2024).

Automated kernel optimization is emerging: NPUEval benchmarks LLM-generated NPU kernels for both correctness and vectorization performance. State-of-the-art LLMs currently achieve only ~10% average vectorization, though select kernels reach >50% (Kalade et al., 18 Jul 2025). The process underscores current fragmentation and points to the importance of retrieval-augmented generation and iterative compiler feedback in future NPU kernel workflows.

6. Sparse Workload Memory Access

Sparse DNN workloads yield irregular memory accesses, leading to cache-stall bottlenecks on SIMD NPUs. NVR (NPU Vector Runahead) is a speculative prefetcher, operating as a lightweight sub-thread that issues vectorized load instructions based on snooped state and dependency chain reconstruction (Wang et al., 19 Feb 2025). With minimal hardware overhead (<5%), it reduces cache misses by 90% and yields 4× speedup on sparse workloads. Adding a non-blocking speculative buffer (16 KB) to the NPU amplifies effects—delivering performance benefits up to 5× greater than increasing L2 cache by the same size.

7. Application Domains and Future Directions

AMD NPUs are leveraged for both inference and client-side training. Bare-metal programming unlocks custom fine-tuning (e.g., GPT-2, via LLM.c offloaded to NPU), accelerating matrix multiplications by over 2.8× and achieving FLOPS/W improvements of 1.4× (Rösti et al., 3 Apr 2025). Object localization frameworks optimized for NPUs (OCDet) utilize bounding box-aware heatmaps and NPU-friendly backbones, reducing latency by 64% over YOLO-type detectors while increasing alignment scores (Xin et al., 23 Nov 2024).

Future work identified includes extending compiler stacks and quantization schemes for further aging/thermal resilience, runtime adaptation of attention folding and tiling, fully automated kernel generation (pass@k iterative refinement, agentic workflows), and broader hybrid frameworks capable of exploiting the strengths of CPUs, GPUs, and NPUs in concert (Kalade et al., 18 Jul 2025, Karami et al., 19 Jul 2025, Wang et al., 16 Oct 2025). This suggests that as model complexity and hardware heterogeneity increase, coordinated advances in scheduling, compiler optimization, and hardware-aware programming will be foundational to further unlocking AMD NPUs’ latency, throughput, and perf/watt advantages.

In summary, AMD’s NPUs embody a distinct class of spatial accelerators whose performance is determined by explicit data movement, fine-grained scheduling, fused kernel execution, and architectural support for dynamic AI workloads and memory management. Compiler frameworks such as MLIR-AIR, reliability-aware design, and emerging automation via LLM-driven kernel generation are critical enablers that continue to shape the evolving landscape of neural processing on AMD hardware.