Ryzen AI XDNA NPUs: Architecture & Performance

Updated 20 December 2025

Ryzen AI XDNA NPUs are specialized processing units in AMD's Ryzen AI SoCs, featuring a highly parallel, software-managed dataflow architecture for both inference and training.
They employ a 2D grid design of compute and memory tiles with advanced DMA and tiling strategies, achieving up to 50 TOPS (INT8) and 25 TFLOPS (BF16) performance.
Their robust programming toolchains, including AMD IRON and MLIR-AIR, enable precise spatial scheduling and real-time workload management for diverse AI applications.

Ryzen AI XDNA NPUs are specialized neural processing units integrated into AMD's Ryzen AI system-on-chips (SoCs), supporting both inference and training workloads for modern deep learning models. These NPUs are designed for dense matrix-intensive operations common in real-time generative AI, transformer models, and convolutional neural networks (CNNs), and expose a highly parallel, software-managed dataflow microarchitecture with fine-grained tiling and scheduling control. Their architecture evolution spans multiple generations, notably XDNA ("Phoenix") and XDNA2, with improvements in throughput, memory hierarchy, and heterogeneous integration with CPUs and GPUs. This entry presents the microarchitectural principles, programming toolchains, optimization methodologies, real-time scheduling frameworks, and empirical performance characteristics for XDNA NPUs on Ryzen AI platforms.

1. Microarchitecture and Memory Hierarchy

Ryzen AI XDNA NPUs are implemented as 2D grids of compute tiles ("AI Engines"), memory tiles (L2 SRAM), and shim tiles interfacing with off-chip DRAM. The XDNA2 in AMD Ryzen AI 9 365, for example, comprises 12 independent compute clusters, each with four 256-bit wide vector ALU lanes. This configuration permits 16×16-bit or 64×4-bit MACs per SIMD lane per cycle. The internal interconnect is a 512-bit ring network, ensuring fast coupling between cores and the on-chip 4 MB shared weight SRAM. Each compute core includes a 128 KB scratchpad for activations, and the system provides a dedicated 8 MB L2-style buffer for caching and buffering tile I/O. The off-chip path is dual-channel LPDDR5, with an aggregate bandwidth of up to 120 GB/s, while on-chip SRAM achieves 800 GB/s bandwidth. Specialized accelerators support INT4, INT8, and BF16 MAC arrays, with hardware primitives for softmax, layer norm, and key-value cache streaming engines for transformers. Peak sustained throughput is up to 50 TOPS (INT8) or approximately 25 TFLOPS (BF16) on XDNA2 (Karami et al., 19 Jul 2025, Taka et al., 15 Dec 2025).

First-generation XDNA ("Phoenix") implements a 4×5 (20-core) or 4×4 (16-core) grid, each compute tile with 64 KB L1 local memory, 128-wide vector units for bfloat16 FMA, and independent DMA engines. Shim and memory tiles (L2: 512 KB per tile) coordinate on-chip and off-chip data movement (Rösti et al., 3 Apr 2025).

2. Programming Model and Toolchains

NPUs in Ryzen AI XDNA platforms expose direct control over spatial scheduling, data-movement, and buffering, enabled via a multilayered software stack:

Bare-Metal Toolchains: The AMD IRON toolchain, MLIR-AIE/MLIR-AIR compilers, and close-to-metal C++ APIs allow custom kernel development. The toolchain emits static tile programs (top.xclbin), runtime instruction streams (insts.txt), and supports explicit tile/program binding, switch-box configuration, and DMA descriptor programming (Rösti et al., 3 Apr 2025, Wang et al., 16 Oct 2025).
MLIR-AIR: The MLIR-AIR framework introduces the AIR dialect, a hierarchical IR enabling asynchronous, spatial, and hierarchical mapping of operations. Primitives such as air.launch, air.herd, and air.channel.{put/get} orchestrate kernel launches, spatial work division, and double-buffered DMA across tile grids. Compiler passes increasingly lower high-level loop nests into efficient spatial kernels mapped to the NPU grid (Wang et al., 16 Oct 2025).
High-Level Framework Integration: ONNX-GenAI (LLM), ONNX Runtime (CNN), and Vitis AI Execution Provider backend facilitate deployment of quantized networks (AWQ INT4, INT8) on the NPU and, where suitable, partition workloads between NPU and GPU (Karami et al., 19 Jul 2025).

3. Algorithmic Optimization and Performance Modeling

Performance-critical operations such as general matrix multiplication (GEMM) and transformer attention blocks are optimized via explicit multi-level tiling, microkernel design, and dataflow-aware scheduling:

GEMM Optimization: Core performance tuning uses a four-level tiling strategy—(1) microtiles for per-core L1, (2) core-local tiles, (3) multicore tiles, (4) DRAM-level outer tiles. The balanced-point methodology finds tile sizes that simultaneously saturate compute and hide DMA latencies, using analytic models of compute cycles, DMA cycles, and L1/DRAM constraints. Layouts maintain A and C in row-major, B in column-major in main memory for contiguous accesses, leveraging multi-D DMA for in-flight transpositions (Taka et al., 15 Dec 2025).
Transformer Attention (Zen-Attention): The Zen-Attention framework implements "dynamic attention folding," mapping the Q·Kᵀ→Add→SoftMax→(SoftMax·V) fusion as a single spatially tiled sweep resident in L1/L2. Only inputs and final Z tensors traverse DRAM, reducing bandwidth by up to 4×. Tiling parameters (Mₜ,Nₜ,Kₜ) are chosen to maximize L1 reuse and minimize DRAM access, subject to the buffer constraint $S_{L1}=b_p(MₜKₜ+NₜKₜ+MₜNₜ)\leq64\mbox{ KB}$ (Deshmukh et al., 25 Aug 2025).

4. Scheduling for Real-Time Generative AI

The deployment of real-time, multi-model AI workloads (RTGen)—combining LLMs, RAG pipelines, and real-time CNNs—on heterogeneous SoCs exposes complex scheduling challenges. Five dynamic scheduling policies have been evaluated:

Policy	LLM Latency?	Real-Time Aware?	Dynamic Backend?	GenAI-Aware?
FCFS-AOT	Yes	No	No	No
FCFS-DYN	Yes	No	Yes	No
EDF-AOT	No	Yes	No	No
EDF-DYN	No	Yes	Yes	No
FTF	Partial	Yes	Yes	Yes

Trade-offs are characterized by striking a balance between low LLM time-to-first-token (TTFT) and low deadline violation rate (DVR) for CNNs. For example, under scenario D ("video conference"), FTF scheduling achieves a 41.7% absolute improvement in DVR over FCFS (DVR: 20.8% vs 66.5%) while preserving TTFT (1578 ms). However, decode TPS declines due to shared NPU/GPU contention. LLM prefill stages are consistently 3× faster on NPU compared to GPU; CNN super-resolution and object detection are also fastest on NPU (up to 84% speedup), but segmentation often prefers GPU (Karami et al., 19 Jul 2025).

Best practices include assigning LLM prefill to NPU and decode to GPU, pipelining CNNs on NPU, and applying a two-phase dynamic scheduler with a special treatment of GenAI stages using urgent sub-deadlines and subsequent true deadline-driven ordering for remaining workloads.

5. Compiler Support and Spatial Mapping

The MLIR-AIR compiler stack enables efficient spatial mapping by transforming high-level loop nests to explicit tile-scheduled programs. Essential features include:

Hierarchical Resource Launch: Kernels dispatched onto arbitrary subgrids using air.launch and air.herd.
Overlapped Communication and Compute: Double-buffered DMA and asynchronous streaming primitives (air.channel.{put,get}) enable overlapping data movement with matmul computation, hiding round-trip memory latency.
Auto-generated Synchronization: Async tokens and air.wait_all maintain explicit control over data and execution dependencies.
Tiling and Pipelining Efficiency: Compute-to-communication efficiency is modeled by

$\mathrm{Efficiency} = \frac{2T_MT_NT_K}{2T_MT_NT_K + \alpha(T_MT_K + T_KT_N)}$

with tile shapes optimized to maximize L1 usage and effective channel depth (Wang et al., 16 Oct 2025).

For representative workloads, MLIR-AIR achieves 48.6–78.7% of silicon peak with <3 pp gap to hand-optimized MLIR-AIE code.

6. Empirical Performance and Benchmarking

GEMM Throughput: On XDNA, up to 6.76 TOPS (int8), 3.14 TOPS (bf16); on XDNA2, up to 38.05 TOPS (int8), 14.71 TOPS (bf16). Balanced-point tiling achieves 95–98% L1 utilization (single-core), DRAM BW bound for small ARI, and overall 5–6× TOPS increase in XDNA2 over XDNA at similar utilization (Taka et al., 15 Dec 2025).
Attention Block Latency: Zen-Attention yields 4× reduction in attention-block DRAM bandwidth, up to 4× speedup in attention latency, and up to 32% reduction in end-to-end transformer latency (Deshmukh et al., 25 Aug 2025).
End-to-End Model Performance: Fine-tuning GPT-2 on-client achieves 1.7× throughput gain (mains), 1.2× (battery), and 1.4× FLOP/Ws energy efficiency improvement. Per-GEMM speedups reach 4.2× on large matrix sizes (Rösti et al., 3 Apr 2025).
Scheduling Impact: FTF scheduler on RTGen workloads reduces deadline violations by 41.7% compared to FCFS but incurs a trade-off with post-prefill LLM TPS (Karami et al., 19 Jul 2025).

7. Scheduling and Design Insights

Empirical results and analyses indicate that maximal XDNA NPU utilization and on-device AI performance require:

Fine-grained, per-layer latency modeling for both NPU and GPU.
Deadline-driven, GenAI-aware work queue prioritization.
Dynamically enforced NPU utilization caps (<80%) to prevent excessive queuing.
Adaptive reassignment of backend mapping in response to observed performance violations.
Compiler- and framework-level integration that leverages the explicit hardware model (via MLIR-AIR, IRON) for auto-scheduling.

A plausible implication is that, as LLM context lengths increase, more aggressive dynamic throttling and resource rebalance between CNNs and GenAI streams are necessary to balance TTFT and real-time deadline adherence.

Ryzen AI XDNA NPUs represent a software-managed, spatially programmed accelerator fabric with scalable, domain-tailored support for both inference and (select) training workloads. Their empirical performance and software programmability are defined by the close interaction between hardware tiling, DMA/dataflow orchestration, and dynamic, workload-aware scheduling policy (Karami et al., 19 Jul 2025, Deshmukh et al., 25 Aug 2025, Taka et al., 15 Dec 2025, Wang et al., 16 Oct 2025, Rösti et al., 3 Apr 2025).