Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 69 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 198 tok/s Pro

GPT OSS 120B 461 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

Efficiency, Expressivity, and Extensibility in a Close-to-Metal NPU Programming Interface (2504.18430v1)

Published 25 Apr 2025 in cs.SE

Abstract: Accelerators such as neural processing units (NPUs) deliver an enticing balance of performance and efficiency compared to general purpose compute architectures. However, effectively leveraging accelerator capabilities is not always simple: low-level programming toolkits may require substantial developer effort while high-level programming toolkits may abstract critical optimization features. This work aims to increase efficiency of designers using IRON, a toolkit for close-to-metal NPU performance engineers. We provide an updated programmer interface to IRON containing new and refined programming constructs. The new interface includes extensible features for placement and data transformation. These contributions are evaluated in terms of 1) efficiency, with analysis showing ~26% average reduction in lines of code and decreases in Halstead metrics for a variety of designs; 2) expressivity, demonstrating the new interface supports the wide range of features and patterns already supported by IRON; and 3) extensibility, illustrating the new tooling for placement and tiling can be extended to accommodate common use-cases.

Summary

The paper introduces a new Python API for IRON that defers MLIR generation, reducing code duplication and cutting design complexity by an average of 26% in SLOC.
The paper refines core data movement and compute constructs—such as ObjectFifo, Worker, Runtime, and Program—to enhance readability and maintain full expressivity.
The paper develops extensible interfaces, including a custom placement tool and the taplib library, enabling intuitive DMA transformations and easier automated design space exploration.

Programming modern neural processing units (NPUs) like the AMD XDNA™ presents a challenge for developers. While high-level frameworks abstract hardware details, they may hide critical optimization opportunities. Conversely, low-level toolkits, while offering fine-grained control necessary for performance tuning, can be complex and require significant developer effort. This paper introduces contributions to IRON, an open-source, close-to-metal toolkit for AMD XDNA™ NPUs, aiming to improve programmer efficiency, maintain expressivity, and enhance extensibility. (2504.18430)

The core problem addressed is the inherent tension in low-level NPU programming interfaces between ease of use (designer efficiency) and the necessity to expose nuanced hardware capabilities (expressivity). The paper proposes an updated programming interface for IRON implemented as a new Python API layered above the existing MLIR-based interface (mlir-aie).

Key contributions and their practical implications include:

New Top-Level Python API: This API creates a layer of abstraction, deferring the generation of the underlying MLIR operations until a resolve function is called. This allows for Python objects in the API to be constructed without immediate constraints imposed by MLIR's structure, reducing information duplication (e.g., specifying placement multiple times) and simplifying design representation.
Refined ObjectFifo API: The new API simplifies the declaration and use of ObjectFifos, which are critical for managing explicit data movement on NPUs. Default values (like depth for ping-pong buffering) are introduced, endpoint inference is supported, and the need to explicitly specify ObjectFifoPort (Consume/Produce) during acquire and release is eliminated. New methods like forward, split, and join are provided on ObjectFifoHandles to simplify expressing complex data movement patterns like L2 buffering or splitting/joining data streams across memory tiles.
New Constructs: Worker, Runtime, and Program:
- The Worker construct separates the definition of compute logic (core_fn) from its configuration and arguments (fn_args), improving code structure and facilitating metaprogramming (adapting kernels based on data types, dimensions, etc.).
- The Runtime construct provides a clearer interface for defining the sequence of operations executed by the host processor (e.g., start for workers, fill and drain for L3-to-ObjectFifo data transfers using DMAs). An inline_ops method allows experts to insert custom MLIR operations when needed.
- The Program construct composes a Runtime sequence with a specific Device (NPU type) and, optionally, a Placer to generate the final MLIR design.
Extensible Placement Interface: The new API introduces a Placeable interface for design components and a Placer argument for the Program.resolve_program method. This allows designers to either manually specify placement or use a custom Placer implementation to automatically assign design constructs to physical tiles (AIE, Memory, Shim). This addresses the tedious nature of manual placement and enables the creation of algorithmic placement tools without requiring deep compiler engineering knowledge.
taplib for Data Transformations: A new library, taplib, is introduced to provide a more intuitive way to express complex on-the-fly data transformations performed by DMAs. It defines TensorAccessPattern (tap) and TensorAccessSequence (tas) objects constructed from tensor dimensions, sizes, strides, and offsets. taplib includes tools for reasoning about these patterns, including visualizations (heat maps showing access order and count) and programmatic analysis (e.g., checking access count sums or order maximums). It introduces the concept of "access equivalence" for patterns that generate identical access maps, which is crucial for NPU DMAs with varying constraints. The Runtime's fill and drain methods can now accept a tap object instead of raw sizes and strides.

Implementation and Evaluation:

The contributions are implemented in Python (~1,400 LOC for the API, ~560 LOC for taplib). Evaluation was performed on a suite of 27 diverse IRON designs (ranging from simple data copies to complex GEMM, ResNet blocks, and vision pipelines).

Efficiency: The new API significantly increased designer efficiency. Designs written with the new API showed an average reduction of ~26% in Single Lines of Code (SLOC) and reductions in Halstead metrics (vocabulary and effort), indicating less code and reduced complexity for the programmer.
Expressivity: The new API maintained the full expressivity of the previous IRON interface. All 27 example designs could be expressed with the new API, exhibiting consistent performance compared to their original implementations (average latency difference ~3.36%). Static analysis of generated MLIR, controlling for declaration order, showed identical MLIR for 20 designs, and functionally equivalent (access equivalent patterns or reordered broadcasts) MLIR for the remaining 7.
Extensibility: The placement interface and taplib were demonstrated to be extensible. A simple SequentialPlacer was implemented (64 SLOC) and successfully applied to fully or partially place 24 of 27 designs, showing the ease of creating custom placers. A TensorTiler2D generator was implemented using taplib (277 SLOC) and applied to 5 designs, significantly simplifying the expression of DMA tiling patterns.

Practical Applications:

This work directly benefits performance engineers and compiler developers targeting AMD XDNA™ NPUs using IRON.

Reduced Development Effort: The simplified API reduces the boilerplate code and mental overhead associated with expressing NPU designs, allowing engineers to focus on the core compute and dataflow logic.
Improved Readability and Maintainability: By abstracting MLIR specifics and providing higher-level constructs like Worker and Runtime, designs become easier to read, understand, and maintain.
Easier Exploration of Design Space: The extensible placement interface allows engineers to quickly iterate on different placement strategies, potentially using custom or algorithmic placers, without manual tile assignment for every component.
Intuitive Data Transformation Definition: taplib provides a structured way to define complex DMA transformations, reducing reliance on error-prone manual calculation of sizes and strides. Visualizations and analysis tools help verify correctness.
Foundation for Automation: The extensible interfaces serve as well-defined integration points for future automation tools, such as advanced placement algorithms or domain-specific tiling generators, which can be built on top of the IRON API.

In summary, this paper demonstrates that carefully designed programming interfaces, even at a low level, can significantly improve programmer efficiency and facilitate extensibility for hardware-specific optimizations without sacrificing the necessary expressivity for performance tuning on complex accelerators like NPUs.