- The paper introduces a new Python API for IRON that defers MLIR generation, reducing code duplication and cutting design complexity by an average of 26% in SLOC.
- The paper refines core data movement and compute constructs—such as ObjectFifo, Worker, Runtime, and Program—to enhance readability and maintain full expressivity.
- The paper develops extensible interfaces, including a custom placement tool and the taplib library, enabling intuitive DMA transformations and easier automated design space exploration.
Programming modern neural processing units (NPUs) like the AMD XDNA™ presents a challenge for developers. While high-level frameworks abstract hardware details, they may hide critical optimization opportunities. Conversely, low-level toolkits, while offering fine-grained control necessary for performance tuning, can be complex and require significant developer effort. This paper introduces contributions to IRON, an open-source, close-to-metal toolkit for AMD XDNA™ NPUs, aiming to improve programmer efficiency, maintain expressivity, and enhance extensibility. (2504.18430)
The core problem addressed is the inherent tension in low-level NPU programming interfaces between ease of use (designer efficiency) and the necessity to expose nuanced hardware capabilities (expressivity). The paper proposes an updated programming interface for IRON implemented as a new Python API layered above the existing MLIR-based interface (mlir-aie
).
Key contributions and their practical implications include:
- New Top-Level Python API: This API creates a layer of abstraction, deferring the generation of the underlying MLIR operations until a
resolve
function is called. This allows for Python objects in the API to be constructed without immediate constraints imposed by MLIR's structure, reducing information duplication (e.g., specifying placement multiple times) and simplifying design representation.
- Refined ObjectFifo API: The new API simplifies the declaration and use of
ObjectFifo
s, which are critical for managing explicit data movement on NPUs. Default values (like depth for ping-pong buffering) are introduced, endpoint inference is supported, and the need to explicitly specify ObjectFifoPort
(Consume/Produce) during acquire
and release
is eliminated. New methods like forward
, split
, and join
are provided on ObjectFifoHandle
s to simplify expressing complex data movement patterns like L2 buffering or splitting/joining data streams across memory tiles.
- New Constructs: Worker, Runtime, and Program:
- The
Worker
construct separates the definition of compute logic (core_fn
) from its configuration and arguments (fn_args
), improving code structure and facilitating metaprogramming (adapting kernels based on data types, dimensions, etc.).
- The
Runtime
construct provides a clearer interface for defining the sequence of operations executed by the host processor (e.g., start
for workers, fill
and drain
for L3-to-ObjectFifo data transfers using DMAs). An inline_ops
method allows experts to insert custom MLIR operations when needed.
- The
Program
construct composes a Runtime
sequence with a specific Device
(NPU type) and, optionally, a Placer
to generate the final MLIR design.
- Extensible Placement Interface: The new API introduces a
Placeable
interface for design components and a Placer
argument for the Program.resolve_program
method. This allows designers to either manually specify placement or use a custom Placer
implementation to automatically assign design constructs to physical tiles (AIE, Memory, Shim). This addresses the tedious nature of manual placement and enables the creation of algorithmic placement tools without requiring deep compiler engineering knowledge.
taplib
for Data Transformations: A new library, taplib
, is introduced to provide a more intuitive way to express complex on-the-fly data transformations performed by DMAs. It defines TensorAccessPattern
(tap
) and TensorAccessSequence
(tas
) objects constructed from tensor dimensions, sizes, strides, and offsets. taplib
includes tools for reasoning about these patterns, including visualizations (heat maps showing access order and count) and programmatic analysis (e.g., checking access count sums or order maximums). It introduces the concept of "access equivalence" for patterns that generate identical access maps, which is crucial for NPU DMAs with varying constraints. The Runtime's fill
and drain
methods can now accept a tap
object instead of raw sizes and strides.
Implementation and Evaluation:
The contributions are implemented in Python (~1,400 LOC for the API, ~560 LOC for taplib
). Evaluation was performed on a suite of 27 diverse IRON designs (ranging from simple data copies to complex GEMM, ResNet blocks, and vision pipelines).
- Efficiency: The new API significantly increased designer efficiency. Designs written with the new API showed an average reduction of ~26% in Single Lines of Code (SLOC) and reductions in Halstead metrics (vocabulary and effort), indicating less code and reduced complexity for the programmer.
- Expressivity: The new API maintained the full expressivity of the previous IRON interface. All 27 example designs could be expressed with the new API, exhibiting consistent performance compared to their original implementations (average latency difference ~3.36%). Static analysis of generated MLIR, controlling for declaration order, showed identical MLIR for 20 designs, and functionally equivalent (access equivalent patterns or reordered broadcasts) MLIR for the remaining 7.
- Extensibility: The placement interface and
taplib
were demonstrated to be extensible. A simple SequentialPlacer
was implemented (64 SLOC) and successfully applied to fully or partially place 24 of 27 designs, showing the ease of creating custom placers. A TensorTiler2D
generator was implemented using taplib
(277 SLOC) and applied to 5 designs, significantly simplifying the expression of DMA tiling patterns.
Practical Applications:
This work directly benefits performance engineers and compiler developers targeting AMD XDNA™ NPUs using IRON.
- Reduced Development Effort: The simplified API reduces the boilerplate code and mental overhead associated with expressing NPU designs, allowing engineers to focus on the core compute and dataflow logic.
- Improved Readability and Maintainability: By abstracting MLIR specifics and providing higher-level constructs like
Worker
and Runtime
, designs become easier to read, understand, and maintain.
- Easier Exploration of Design Space: The extensible placement interface allows engineers to quickly iterate on different placement strategies, potentially using custom or algorithmic placers, without manual tile assignment for every component.
- Intuitive Data Transformation Definition:
taplib
provides a structured way to define complex DMA transformations, reducing reliance on error-prone manual calculation of sizes and strides. Visualizations and analysis tools help verify correctness.
- Foundation for Automation: The extensible interfaces serve as well-defined integration points for future automation tools, such as advanced placement algorithms or domain-specific tiling generators, which can be built on top of the IRON API.
In summary, this paper demonstrates that carefully designed programming interfaces, even at a low level, can significantly improve programmer efficiency and facilitate extensibility for hardware-specific optimizations without sacrificing the necessary expressivity for performance tuning on complex accelerators like NPUs.