NPUEval Benchmark: Evaluating NPUs & PU Learning
- NPUEval Benchmark is a standardized evaluation framework that rigorously assesses NPU kernel generation, ultra-low-power inference, and positive-unlabeled learning with controlled methodologies.
- It employs strict protocols including open-source implementations, iterative LLM feedback, and precise metrics such as vectorization scores, latency, and proxy accuracy to ensure fair comparisons.
- The benchmark fosters future research by providing transparent evaluation criteria, robust calibration techniques, and standardized metrics for both hardware performance and algorithmic effectiveness.
NPUEval Benchmark is a standardized framework for evaluating the performance and reliability of algorithms and systems within domains such as NPU kernel generation, ultra-low-power neural inference, and positive-unlabeled (PU) learning. The NPUEval designation has been used for multiple specialized benchmarks, each tailored to provide rigorous, accessible, and fair comparisons by controlling for experimental inconsistencies and ensuring realistic evaluation conditions.
1. Purpose and Scope
NPUEval Benchmark encompasses several implementations targeting key areas:
- NPU Kernel Optimization: A dataset and protocol for assessing LLM-generated NPU kernels, focusing on vectorization and functional correctness (“NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers” (Kalade et al., 18 Jul 2025)).
- Ultra-Low-Power Inference: A platform to compare commercially available NPUs, dissecting both hardware and software properties for on-device AI workloads (“Benchmarking Ultra-Low-Power NPUs” (Millar et al., 28 Mar 2025)).
- Positive-Unlabeled Learning: A unified benchmark for systematic comparison of PU learning algorithms, with a particular emphasis on accessible, realistic, and fair evaluation across problem settings (“Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms” (Wang et al., 29 Sep 2025)).
NPUEval Benchmarks are characterized by rigorous protocol designs, explicit evaluation metrics, transparent calibration procedures, and the commitment to open-source code releases for reproducibility and extensibility.
2. Methodological Framework
NPU Kernel Generation
The kernel generation benchmark comprises 102 common ML operator kernels. LLMs are prompted to generate self-contained C++ code for AIE (AI Engine) tiles, using strict system prompts that enforce canonical code style and correct function naming. Evaluation is performed using an open-source LLVM-AIE compiler targeting AMD NPUs.
Metrics:
- Functional Correctness: Kernel output on hardware is compared to reference implementations, with absolute error thresholds set (e.g., 1e–2 for common ops).
- Vectorization Score: Fraction of cycles spent in vector (VPU) instructions vs. total cycles, computed as .
Iterative compilation with compiler feedback and retrieval augmented generation from open source examples augment LLM responses, aiming to improve both correctness and hardware-oriented code optimization.
Ultra-Low-Power NPU Evaluation
For NPUs, the NPUEval Benchmark standardizes multi-stage evaluation:
- Stages: NPU initialization, memory input/output, core inference, CPU post-processing, and idle power background are measured separately.
- Metrics: Latency (), power consumption (), energy efficiency (“inferences per mJ”), memory footprint (from linker .map files).
Model compilation is facilitated by an open-source workflow that translates canonical Torch/ONNX/TFLite models to platform-specific binaries with uniform optimization and quantization procedures, eliminating confounding variables between hardware platforms.
Positive-Unlabeled Learning
The PU learning benchmark provides a unified PyTorch codebase containing implementations of 17 representative PU methods, with rigorous control of warm-ups, data augmentation, and algorithmic variations.
Problem Settings:
- Two-Sample (TS): Positive examples from ; unlabeled from .
- One-Sample (OS): Unlabeled set is sampled first, then positives “revealed” per probability .
Evaluation Metrics:
- Proxy Accuracy (PA): For TS, .
- Proxy AUC (PAUC): Area under the ROC curve using only positive and unlabeled validation data.
- Oracle Accuracy (OA): Used only for retrospective analysis (true labels available).
A calibration procedure for internal label shift (ILS) is introduced for the OS setting, reweighting loss terms according to the effective positive proportion .
3. Implementation Protocols and Calibration
Each NPUEval instantiation provides open-source implementations with strict reproducibility guarantees:
- Kernel Evaluation: Automated post-processing removes extraneous boilerplate from LLM outputs. Up to 10 recompilation attempts feed compiler error logs back to the model for iterative refinement.
- NPU Benchmark: Template code and standardized quantization avoid platform-specific optimizations. All measurements are averaged over multiple runs, with full breakdowns for stage-wise resource usage.
- PU Learning Calibration: During each mini-batch, the positive batch is augmented into the unlabeled batch when calculating estimated negative risk, counterbalancing the internal label shift (see Algorithm 1 in (Wang et al., 29 Sep 2025)).
The emphasis on protocol fidelity ensures that experimentally observed differences reflect real model or hardware properties, not artifacts of inconsistent methodology.
4. Comparative Insights and Benchmarking Outcomes
Kernel Optimization
Experiments reveal that LLMs frequently pass functional correctness but struggle to achieve high vectorization scores. For example, vectorization rates average only ~10% across the dataset, with occasional outliers (e.g., DeepSeek R1 achieving 50%+ on select kernels but not consistently). Compilers are sensitive to API usage: hallucinated pragmas or vector interfaces often yield compilation failures or suboptimal hardware utilization.
Ultra-Low-Power Inference
The NPU benchmark demonstrates expected and surprising scaling trends:
- MAX78000 exhibits high energy efficiency but is bottlenecked by memory I/O overhead.
- HX-WE2 offers lower latency but at increased power profile.
- MILK-V is optimal for continuous operation (omitting initialization cost). General-purpose MCUs lag by 1–2 orders of magnitude in energy efficiency.
Crucially, theoretical compute capacity (GOPs) does not reliably predict practical performance, highlighting the need for empirical benchmarks.
PU Learning Algorithms
The benchmark clarifies that many PU methods perform comparably after calibration, with no single method dominating in all settings. The internal label shift problem—if uncorrected—can considerably bias TS algorithms on OS data, but calibration via proxy risk reweighting restores fairness. PA and PAUC provide accessible selection criteria in the absence of labeled negatives.
5. Challenges, Calibration Pitfalls, and Future Directions
- Fragmented NPU Ecosystem: Limited domain-specific kernel data hinders LLM code generation; backend-specific RAG and compiler feedback are avenues for improvement.
- Memory I/O Bottlenecks: For NPUs, memory throughput and efficient weight-loading mechanisms must be prioritized in future hardware and compilation research.
- Evaluation Pitfalls: Previous PU learning studies often used negative-labeled validation sets, which NPUEval demonstrates are unrealistic—proxy metrics based only on positives and unlabeled samples should be adopted.
Planned extensions include:
- Expanding NPUEval to additional NPU architectures, toolchains, and operator classes (e.g., transformers).
- Formalizing simulation tools for ultra-low-power inference latency and energy prediction.
- Extending calibration theory and proxy validation metrics to large-scale real-world weakly supervised datasets.
6. Open Source Release and Reproducibility
NPUEval Benchmarks—across all domains—publish datasets, evaluation codebases, and hardware configurations under permissive open-source licenses (e.g., LLVM-AIE compiler fork, MLIR-AIE integration, Python bindings for NPUs). This democratizes benchmarking, makes results reproducible on commodity hardware, and fosters collaborative development.
Researchers can replicate experiments, contribute new kernels or operators, adapt evaluation flows to new platforms, and benchmark emerging models against standardized, calibrated protocols—advancing both NPU and PU algorithm research.
7. Significance and Impact
NPUEval Benchmarks provide a unified framework to assess algorithmic and hardware progress in specialized domains traditionally lacking standard evaluation environments. The meticulous control of protocol details, calibrated risk estimation, and open-source implementations address systemic weaknesses in prior benchmarking, setting a new standard for accessible, realistic, and fair evaluation. These benchmarks inform hardware design, software development, and statistical learning advancements, and are positioned as foundational resources for future research in kernel optimization, resource-constrained inference, and weakly supervised learning.