NPUKernelBench: NPU Benchmarking Framework

Updated 19 January 2026

NPUKernelBench is an open-source, extensible benchmarking framework for evaluating NPU kernels’ functional correctness, compilation success, and performance efficiency.
It employs a layered architecture with front-end model conversion, calibration, and device-specific backend optimization to support varied NPU platforms.
The framework reports detailed metrics such as latency, power, memory footprint, and vectorization scores to enable cross-platform performance analysis.

NPUKernelBench is an open-source, extensible benchmarking framework designed to facilitate rigorous, platform-agnostic evaluation of neural processing unit (NPU) kernels. Its principal focus is determining functional correctness, compilation success, and performance efficiency for both manually coded and automatically generated kernels across a wide range of NPU architectures and deployment paradigms. NPUKernelBench underpins multiple landmark studies, supporting model compilation, hardware- and software-based inference, and quantitative hardware analysis for ultra-low-power μNPU, LLM-powered code synthesis, and edge inference environments (Millar et al., 28 Mar 2025, Cao et al., 12 Jan 2026, Kalade et al., 18 Jul 2025, Wen et al., 20 Jul 2025, Jayanth et al., 2024).

1. Layered Architecture and Compilation Workflow

NPUKernelBench consists of distinct front-end, calibration, and backend layers, orchestrating model translation, quantization, and platform-specific kernel generation. The front-end ingests models in common representations (PyTorch, ONNX, TFLite-Micro), converts them to a quantized universal intermediate representation (IR), annotates with bit-width and hardware tags, and supports fully post-training INT8 per-tensor calibration (Millar et al., 28 Mar 2025). Subsequently, the backend delegates to platform toolchains, including ARM Vela (Ethos-U55), Maxim MAXIM SDK (MAX78000), NXP eIQ (MCXN947), GreenWaves GAP SDK (GAP8), CVITEK CVIMODEL (MILK-V Duo), and CMSIS-NN (STM32, ESP32). Each path yields (a) a hardware-optimized binary kernel image and (b) a C inference harness.

At runtime, the driver initializes relevant NPU resources—processing-element arrays, on-chip SRAM, DMA channels—streams model weights (DMA/SRAM), launches kernel execution, and records cycles/power. Minimal CPU-side pre-/post-processing (softmax, NMS) is performed only for unsupported operators. This modular infrastructure enables rapid benchmarking across heterogeneous and resource-constrained systems (Millar et al., 28 Mar 2025).

2. Quantitative and Mathematical Performance Metrics

NPUKernelBench employs precise formulas for latency, power, memory, and efficiency reporting, enabling direct hardware vs. model-level comparison. For end-to-end inference spanning $S$ stages, total latency is $T_{\mathrm{infer}} = \sum_{s=1}^S T_s$ . Power metrics include average and peak power ( $P_{\mathrm{avg}}, P_{\mathrm{peak}}$ ), with average power approximated by the energy-to-duration ratio ( $P_{\mathrm{avg}}\approx E_{\mathrm{infer}} / T_{\mathrm{infer}}$ ). Memory footprint decomposes into code, data, and bss regions, constrained by platform SRAM limits (Millar et al., 28 Mar 2025).

Energy efficiency is measured as inferences per millijoule:

$I_{\mathrm{mJ}} = \frac{1}{E_{\mathrm{infer}}(\mathrm{mJ})} = \frac{1}{P_{\mathrm{avg}}(\mathrm{mW}) \times T_{\mathrm{infer}}(\mathrm{ms}) / 1000}$

Task-level and overall aggregation scores are provided. For ML-generated kernels, NPUKernelBench adopts additional metrics such as vectorization score ( $\frac{C_v}{C_{\mathrm{total}}} \times 100\%$ ), compilation/pass rates ( $\mathrm{Comp}@k$ , $\mathrm{Pass}@k$ ), and speedup relative to canonical baselines (Kalade et al., 18 Jul 2025, Cao et al., 12 Jan 2026, Wen et al., 20 Jul 2025).

3. Benchmark Kernel Sets, Workloads, and Input Characterization

NPUKernelBench’s canonical dataset comprises 102 ML operators (NPUEval) and extends to 285 kernel tasks (MultiKernelBench), systematically spanning element-wise, linear algebra, convolutional, activation, reduction, broadcast, normalization, full-architecture, pooling, fusion, optimizer, and loss primitives (Kalade et al., 18 Jul 2025, Wen et al., 20 Jul 2025). Model coverage includes both synthetic (CIFAR10-NAS, SimpleNet, ResidualNet, YOLOv1_small, Autoencoder) and real-world tasks (MobileNetV2, LSTM, Transformer attention) (Millar et al., 28 Mar 2025, Jayanth et al., 2024).

Inputs are statically or dynamically shaped, with floating-point and integer types validated under hardware-representative tolerances ( $\tau_a, \tau_r$ e.g. $10^{-3}$ to $T_{\mathrm{infer}} = \sum_{s=1}^S T_s$ 0 for float values, $T_{\mathrm{infer}} = \sum_{s=1}^S T_s$ 1 for NPUs with bfloat16). All calibration is performed post-training to INT8 per-tensor to ensure reproducibility and platform-neutral accuracy.

4. Experimental Findings: Latency, Power, and Platform Scaling

Empirical results reveal pronounced divergence between vendor-claimed NPUs throughput (GOPS/TFLOPS) and real-world benchmarks, attributable to memory-I/O bottlenecks, DMA setup overheads, and partial operator support (Millar et al., 28 Mar 2025). For instance, MAX78000’s stated 30 GOPS peak is rarely realized due to memory bandwidth constraints (<5 GOPS on some layers), while Ethos-U55 (HX-WE2) achieves raw inference superiority but suffers reduced energy efficiency from higher idle/max power and initialization cost. MCXN947 outpaces general MCUs in CNN inference for moderately sized inputs, but not in power draw, whereas MILK-V Duo provides optimal steady-state performance only if initialization is amortized.

Across client NPUs (Intel AIPC), NPUKernelBench establishes that matrix-vector tasks are 2.6× faster and 3× more energy-efficient on NPUs versus GPUs; converse findings hold for matrix-matrix and LSTM workloads, where GPU SIMD/fused pipelines dominate (Jayanth et al., 2024). Batch size, model complexity, and data precision are decisive for NPU advantage.

For LLM-driven kernel synthesis (AscendKernelGen, NPUEval, MultiKernelBench), compilation and pass rates are highly sensitive to prompt strategy, retrieval augmentation, and feedback integration. Baseline general-purpose LLMs yield ≤20% compilation success on complex NPU kernels versus 95.5% for domain-adaptive RL-tuned models (level-2) (Cao et al., 12 Jan 2026). Only 10% average vectorization is attained by most models, but carefully engineered prompting triples correctness and speedup (from 0% to >35% in category-aware setups) (Wen et al., 20 Jul 2025).

5. Design Insights, Challenges, and Manufacturer Comparison

Critical architectural findings show the importance of (a) memory hierarchy and DMA scaling, (b) flexible operator fusion (eliminating CPU fallback), (c) aggressive weight-stationary dataflows, and (d) tailored initialization/wake paths for low-power duty-cycling (Millar et al., 28 Mar 2025). Under practical benchmarking, discrepancies emerge: advertised peak compute (GOPS) masks substantial inefficiency due to limited I/O, on-chip contention, and non-uniform operator support. Roofline analysis denotes arithmetic intensity–bounded task performance; NPU gains manifest for memory-bound kernels, while GPU excels when compute-bound.

In ML kernel generation, scarcity of NPU-DSL code in pretraining data produces high rates of hallucinated APIs, type mismatches, and poor interface compliance, affecting compilation rates (Kalade et al., 18 Jul 2025, Wen et al., 20 Jul 2025). Only prompt engineering (retrieval-augmented, category-aware, iterative compiler feedback) restores partial coverage on underexposed platforms, indicating future research must target data enrichment and DSL-specific fine-tuning.

6. Practical Usage Guidelines and Extensibility

NPUKernelBench is released under a permissive open-source license, with detailed reproducibility steps published for AscendKernelGen and NPUEval (Cao et al., 12 Jan 2026, Kalade et al., 18 Jul 2025). Core interfaces support modular backend registration, Python-driven evaluation, and task–shape parameterization. For new platforms, backend abstraction supports seamless device initialization, host-device code assembly, tiling, compilation, execution, output verification, and metric reporting.

Researchers are advised to (a) profile CPU pre-/post-processing (e.g., small NPUs may bottleneck on softmax, NMS), (b) distinguish continuous vs. intermittent inference workloads, (c) balance batch size and complexity for maximal $T_{\mathrm{infer}} = \sum_{s=1}^S T_s$ 2, and (d) adopt roofline and memory-trace profiling to guide offload strategies (Millar et al., 28 Mar 2025, Jayanth et al., 2024). Prompt engineering for LLMs should employ in-category exemplars and compiler feedback loops to boost compilation success and correctness (Cao et al., 12 Jan 2026, Wen et al., 20 Jul 2025).

7. Research Outlook and Future Directions

Ongoing efforts prioritize (a) extending NPUKernelBench to additional architectures (Graphcore, Qualcomm Hexagon, Apple NPU), (b) expanding calibration/quantization flows for >8-bit, mixed-precision, and sparsity-aware kernels, (c) dataset augmentation with vendor DSL instances and API documentation, (d) integrating chain-of-thought reasoning and RL signals for adaptive kernel code synthesis, and (e) advancing agentic multi-turn workflows toward human-expert code quality (Kalade et al., 18 Jul 2025, Cao et al., 12 Jan 2026, Wen et al., 20 Jul 2025).

A plausible implication is that advances in NPU benchmarking and LLM-aided design depend as much on curated domain examples, DSL-centric fine-tuning, and prompt methodology as on hardware itself. Systematic evaluation frameworks such as NPUKernelBench are vital to the reproducible optimization and cross-platform benchmarking needed for next-generation edge AI and low-power inference.