Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 170 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 41 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers (2507.14403v1)

Published 18 Jul 2025 in cs.PL

Abstract: Neural processing units (NPUs) are gaining prominence in power-sensitive devices like client devices, with AI PCs being defined by their inclusion of these specialized processors. Running AI workloads efficiently on these devices requires libraries of optimized kernels. Creating efficient kernels demands expertise in domain-specific C++ with vector intrinsics and in-depth knowledge of the target architecture. Unlike GPU programming, which has had years to mature, NPU programming is new, with smaller and more fragmented developer communities across hardware platforms. This fragmentation poses a challenge when utilizing LLMs to assist in writing NPU kernels, as domain-specific optimized code examples are underrepresented in LLM pre-training data. In this paper we introduce NPUEval -- a benchmark for writing and evaluating NPU kernels, consisting of 102 common operators for machine learning workloads. We evaluate LLM generated code on actual hardware based on both functional correctness and vectorization efficiency using open source compiler tools targeting the AMD NPU. We evaluate a range of state-of-the-art LLMs with a mix of proprietary and open-weight models. Latest reasoning models like DeepSeek R1, show promising results achieving out-of-the-box 50%+ vectorization on select kernels. However, the average score across the entire dataset remains roughly 10% even with compiler feedback and vectorized kernel examples -- showing that this is a challenging dataset even for frontier models. The dataset and evaluation code will be released with a permissive open source license, providing an essential benchmark for advancing research in code generation and NPU kernel optimization.

Summary

  • The paper introduces NPUEval, a benchmark that evaluates LLM-generated NPU kernel code based on performance, vectorization, and correctness.
  • It demonstrates that advanced LLMs can generate vectorized code, though compiler feedback and RAG techniques are essential to improve efficiency.
  • The evaluation framework uses open-source compilers and detailed performance metrics, highlighting the challenges in optimizing kernels for NPUs.

"NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers"

Introduction

The paper, "NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers," addresses the challenge of generating optimized neural processing unit (NPU) kernels using LLMs and explores the efficacy of these models in code generation specific to NPUs. With NPUs becoming critical components in power-sensitive devices, efficient kernel codes are essential to leverage their full potential. However, unlike GPUs, NPU programming lacks extensive developer resources and mature software ecosystems, posing unique challenges.

NPUEval Benchmarking

NPUEval is introduced as a benchmark designed to evaluate LLM-generated kernel code for AMD NPUs, focusing on functional correctness and vectorization efficiency. The dataset consists of 102 common operators used in machine learning workloads. Evaluation is performed using open-source compilers targeting AMD NPUs. The primary objective is to assist in advancing research in code generation and NPU kernel optimization by providing a public dataset and a comprehensive evaluation framework.

Code Generation with LLMs

The paper leverages various state-of-the-art LLMs to generate kernel code, including proprietary models and open-weight models. Initial results indicate that advanced reasoning models like DeepSeek R1 can achieve commendable vectorization rates on select kernels out-of-the-box, although the average vectorization score across the dataset remains low. The low average score indicates the complexity and challenge posed by the dataset even for advanced models.

Vectorization and Performance Metrics

Vectorization is a critical component of optimizing kernel performance on NPUs. The paper provides a detailed example comparing scalar and vectorized kernel implementations, showcasing the significant performance benefits of vectorized code. The evaluation criteria for the generated kernels include compilation success, functional correctness, and performance metrics, particularly focusing on vectorization scores. Figure 1

Figure 1: Vectorization results

Evaluation Framework

The evaluation harness integrates behavioral models, test vectors, and utilizes the LLVM-AIE compiler, which is adapted specifically for AIE kernel programming. The framework runs the generated kernels, comparing outputs against expected results, and measures cycle-accurate performance metrics to scrutinize VPU utilization. Figure 2

Figure 2: Overview of NPUEval evaluation pipeline.

Enhancements through RAG and Compiler Feedback

The paper employs Compiler Feedback and Retrieval-Augmented Generation (RAG) to improve LLM outputs. RAG incorporates vectorized kernel examples to guide models toward producing better-optimized code. The paper demonstrates that the incorporation of these methods substantially enhances the functional correctness of the generated kernels across various models. Figure 3

Figure 3

Figure 3: Test pass rate as number of recompilations increases.

Results and Observations

The LLM evaluation indicates that smaller models often default to scalar solutions that are syntactically correct but inefficient. In contrast, stronger models attempt more sophisticated vectorized implementations but face challenges with API knowledge and hallucinations. Correctness improved with successive compiler feedback, highlighting its importance in the LLM-driven code generation pipeline. Figure 4

Figure 4

Figure 4: Functional correctness results.

Conclusion and Future Directions

"NPUEval" establishes a vital benchmark for assessing LLM capabilities in generating NPU kernel code and highlights the need for more refined techniques to handle specialized hardware programming challenges. Future work involves extending the benchmark to other NPU architectures and refining the RAG process to accommodate compiler-specific nuances, which would further enhance the applicability and effectiveness of LLMs in this domain.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.