Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 105 tok/s
GPT OSS 120B 476 tok/s Pro
Kimi K2 214 tok/s Pro
2000 character limit reached

Neural Processing Units (NPUs)

Updated 26 August 2025
  • Neural Processing Units (NPUs) are specialized hardware accelerators optimized for deep learning tasks, leveraging dedicated compute fabrics like systolic arrays.
  • They employ advanced techniques such as hardware-assisted virtualization and optimized memory management to enhance performance and enable efficient multi-tenancy.
  • Recent research focuses on improving scheduling strategies, power gating, and simulation frameworks to increase NPUs’ real-time responsiveness and scalability.

Neural Processing Units (NPUs) are specialized hardware accelerators designed to execute deep learning workloads with high throughput and energy efficiency. NPUs exploit the inherent data-parallelism of modern neural network models, offering dedicated compute fabrics—most notably matrix engines such as systolic arrays—and on-chip memory architectures optimized for tensor operations, thus distinguishing themselves from traditional CPUs, GPUs, and digital signal processors. Their evolution has been shaped by the demands of large-scale inference in data center, cloud, edge, and embedded device settings, resulting in a diverse spectrum of design and system-level challenges centered around resource sharing, memory management, performance isolation, real-time responsiveness, and power management.

1. Architectural Principles and Execution Model

NPUs typically feature tightly coupled compute resources such as matrix engines (e.g., systolic arrays), vector units, and software-managed on-chip scratchpads or buffers. The execution model is largely deterministic and dataflow-oriented: neural network layers (e.g., convolution, matrix multiplication, activation) are cast as parallelizable operator kernels, maximally exploiting fine-grained parallelism across multidimensional data tiles.

Matrix computation is often mapped to large 2D systolic arrays, where the performance is bounded by both MAC (multiply–accumulate) utilization and the efficiency of feeding operands from memory, as summarized by tile-level formulas such as:

Timeinnertile=max(C1,M1)\mathrm{Time}_{\text{innertile}} = \max(C_1, M_1)

where C1C_1 is the compute phase time and M1M_1 is the memory access phase, both parameterized in terms of accumulator size, buffer dimensions, and DRAM bandwidth (Choi et al., 2019).

Modern NPUs support hardware–software co-designed preemption, enabling fine-grained checkpointing (saving partial activations via DMA), immediate context “KILL” operations, and draining mechanisms that let a running network finish before switching workloads. Power gating, another architectural differentiator, is implemented with per-component strategies: cycle-level gating for PEs in systolic arrays, hardware-managed gating for memory controllers, and ISA-extended, software-managed gating for vector units and SRAM (Xue et al., 4 Aug 2025).

2. Scheduling, Virtualization, and Multi-Tenancy

To maximize resource utilization while guaranteeing low-latency response for prioritized workloads, NPUs now support multiple consolidation and virtualization methods for multi-tenant inference:

Predictive Multi-Task Scheduling (PREMA):

  • Predicts task execution lengths using either analytically derived or empirical models (from profile data) and dynamically assigns token-based priorities. Candidate tasks are scheduled based on accumulated tokens (reflecting slowdown from sharing) and predicted overall remaining time. In addition, dynamic switching between checkpointed preemption and drain is achieved by comparing the relative execution remaining in the current and candidate tasks (Choi et al., 2019).

Hardware-Assisted Virtualization:

  • Fine-grained resource slicing exposes each NPU as several virtual NPUs (vNPU\mathit{vNPU}), with each tenant receiving a subset of Systolic Arrays (SAs) and vector engines (vEs). An analytical utilization model determines the optimal allocation ratio:

T=1vnm+1mnv+m+v1min(nm,nv)T = \frac{1 - v}{n_m} + \frac{1 - m}{n_v} + \frac{m + v - 1}{\min(n_m, n_v)}

where mm and vv are the workload’s respective SA and vE active ratios, and nmn_m, nvn_v are allocations per tenant (Xue et al., 7 Aug 2024).

  • Extensions to the NPU ISA (e.g., NeuISA) allow decomposition of operators into micro-Tensor Operators (uTops), each independently scheduled by hardware for maximum resource harvesting across tenants.

Topology-Aware Virtualization and Inter-Core Routing:

  • For inter-core connected NPUs, such as Graphcore IPUs, vRouter modules handle both instruction and data route virtualization, mapping virtual core IDs to physical ones. Range-based memory translation mechanisms (vChunk) are used in lieu of page-tables to minimize translation stalls by leveraging monotonic access patterns of DNN layer execution (Feng et al., 13 Jun 2025). Optimal mapping of virtual to physical core topologies uses a graph edit distance cost metric under constraints of resource fragmentation.

3. Memory Management and Data Movement

NPUs face unique memory-system design constraints. Unlike GPUs, whose caches are optimized for instruction-level translation locality, NPUs visit large multidimensional tensor tiles in memory bursts, resulting in mass simultaneous TLB misses:

  • NeuMMU: Adopts a throughput-centric page translation architecture. A Pending Request Merging Buffer (PRMB) merges translation requests for the same page, while a scalable fleet of hardware page-table walkers (up to 128) handles simultaneous translations. Per-walker path registers cache upper page-table levels, avoiding redundant hierarchy accesses (Hyun et al., 2019). This approach maintains near-ideal performance (0.06% overhead) and reduces energy by 16×16\times versus GPU-inspired IOMMUs.
  • Tensor Slicing Optimization (TSO): Compiler optimizations partition CNN tensors hierarchically (tile-level and sub-tile-level) and select among input stationary, output stationary, and weight stationary dataflows to maximize memory reuse and parallelism. DRAM burst behavior is explicitly modeled; for example, transfer time is minimized by aligning tiles with burst size, Ttransfer=tile_sizeBW+bursts×CAST_{\text{transfer}} = \frac{\text{tile\_size}}{\text{BW}} + \text{bursts} \times \text{CAS} (Sousa et al., 2023).
  • Zen-Attention Framework: Dynamic attention folding fuses the entire attention operator sequence (e.g., A=QKTA = QK^T, bias/mask addition, softmax, context projection) into a single hardware mapping, reducing DRAM roundtrips. Tiling and folding are orchestrated to align tensor subvolumes with scratchpad and cache constraints, with transposes fused using hybrid DMA L2 block and register-level shuffles (Deshmukh et al., 25 Aug 2025).

4. Power Efficiency and Reliability

Owing to the high static power in large NPU arrays, fine-grained, adaptive power gating is critical:

  • ReGate: Applies per-PE cycle-level gating in systolic arrays, hardware-controlled gating for ICI/HBM during long idle intervals, and software-ISA-directed gating for vector units and SRAM blocks (with sleep versus off modes). The selection is optimized to ensure that the break-even time BET=Ewakeup/PsavedBET = E_{\text{wakeup}} / P_{\text{saved}} is respected for each transition (Xue et al., 4 Aug 2025). Empirical results demonstrate up to 32.8% energy reduction and negligible (<0.5%) throughput loss.
  • Reliability-Aware Quantization: Instead of clocking with a pessimistic guardband to counteract transistor aging (change in threshold voltage ΔVth\Delta V_{th}), adaptive quantization compresses MAC unit inputs (via stepwise bit-width reductions, e.g., from 8 bits to 8α8\text{–}\alpha/8β8\text{–}\beta). Formally, lower bit-widths are selected at each epoch to ensure critical path delay \leq fresh-chip delay, with an average accuracy loss of only 3% over a ten-year lifetime and a 23% performance gain by eliminating fixed guardbands (Salamin et al., 2021).

5. Benchmarking, Simulation, and Software Ecosystem

NPUs require distinct benchmarking efforts owing to their heterogeneous, rapidly evolving hardware:

  • Edge and Microcontroller NPUs (μNPUs): Open-source frameworks standardize benchmarking by converting universal NN models to INT8-quantized, operator-core subsets compatible across μNPU platforms. Performance is evaluated on end-to-end metrics: inference latency, power (from equipment such as the Monsoon HVPM), energy per inference (E=P×tE = P \times t), and memory footprint parsed from linker map files. Unexpected scaling trends and memory-bound bottlenecks are revealed when comparing devices like MAX78000, HX-WE2, MILK-V, and STM32 microcontrollers (Millar et al., 28 Mar 2025).
  • Large-Scale Simulation: Simulators such as ONNXim support high-speed, cycle-level, multi-core, and multi-tenant modeling, leveraging deterministic tile-level computation and cycle-level DRAM/NoC backends via Ramulator/Booksim. By accepting ONNX graphs (with preserved computational dependencies), ONNXim enables 384×\times performance speedup over detailed acceleration simulators and supports multi-tenant policy evaluation (Ham et al., 12 Jun 2024).
  • LLM-Optimized Kernel Development: NPUEval offers a suite of 102 standard kernel tasks for AMD NPUs, serving as a benchmark for LLM-generated low-level C++ code with vector intrinsics. Empirical results show LLMs achieve 10-20% average vectorization factors, improving to >>50% on select kernels with system prompts, retrieval-augmented generation, and iterative compiler-feedback loops (Kalade et al., 18 Jul 2025).

6. Application Domains and Performance Implications

NPUs excel in memory-bound, latency-sensitive inference tasks—matrix-vector multiplications (58.6% latency reduction vs GPU), transformers/LLMs (up to 3.2×\times speedup for decode stages), and real-time video analytics (3×\times throughput vs GPU for single-frame batch sizes). GPU superiority is observed on large-batch, compute-bound operations due to scaling of execution units and cache utilization (Jayanth et al., 23 Sep 2024).

Specialized frameworks, such as GraNNite for GNNs (Das et al., 10 Feb 2025) and XAMBA for state-space models (Das et al., 10 Feb 2025), adapt irregular or sequential computation to NPU-friendly matrix-based operators via techniques like precomputed masks, shape padding, mask-driven runtime updates, and approximation of activation functions using look-up table (C-LUT) approaches. INT8 quantization and operator-specific approximations enable up to 10.8×\times speedup and 8.6×\times improvement in energy efficiency on edge AI PCs.

7. Future Directions and Research Outlook

Anticipated advancements include:

  • Further automation of software-to-hardware mapping (e.g., folding of dynamic neural blocks, automated code synthesis for vectorized kernels).
  • Integration of advanced memory hierarchies, unified address translation across accelerators, and hardware-managed resource sharing.
  • Expansion of frameworks for neuromorphic and event-driven sensory processing, leveraging energy-efficient spiking neural architectures.
  • Enhanced benchmarking and simulation infrastructure to improve reproducibility and cross-platform comparability.
  • Adoption and abstraction of NPU-specific capabilities (e.g., power gating, dataflow scheduling, topology-aware virtualization) in mainstream ML compiler toolchains and programming models.

NPUs are poised for continued growth in diverse deployment environments, bridging the efficiency and compute demand gap in AI inference workloads across cloud and edge settings, while research continues to refine their scheduling, energy management, memory, and programmability boundaries.