MLIR-AIR: Spatial AI Compiler Stack

Updated 2 February 2026

MLIR-AIR Compiler Stack is an open-source IR framework that bridges high-level AI workloads with spatial accelerator architectures using explicit asynchronous constructs.
It employs a systematic lowering pipeline that transforms structured MLIR into statically scheduled, fine-grained compute and memory operations.
Performance evaluations show that MLIR-AIR achieves high throughput and reduced latency, closely rivaling hand-optimized implementations.

MLIR-AIR (Accelerator Intermediate Representation) is an open-source compiler stack built atop MLIR that bridges the gap between high-level AI workloads—typically expressed as structured loop nests—and the explicit, fine-grained control required by spatial AI hardware such as AMD’s NPUs. It introduces the AIR dialect, a set of SSA-based intermediate representations for capturing asynchronous, hierarchical, and statically scheduled compute and communication operations across heterogeneous compute and memory resources. MLIR-AIR is architected to transform high-level loop- and tensor-centric IR into programs that efficiently orchestrate computation, data movement, and synchronization to maximize utilization on spatially distributed fabrics (Wang et al., 16 Oct 2025). The stack demonstrates competitive performance and programmability, approaching that of hand-optimized, low-level flows.

1. Architectural Foundations and Design Principles

The motivation for MLIR-AIR arises from the need to express explicit placement, communication, and synchronization on architectures consisting of tiled compute elements (cores), private SRAMs, shared L2 tiles, shim tiles for host I/O, and per-tile DMA engines linked by spatial interconnects. General-purpose compilers that target CPU/GPU models abstract away locality and parallelism, precluding effective mapping to modern spatial accelerators. MLIR-AIR targets these requirements by exposing spatial compute hierarchies, memory, and asynchronous execution engines as first-class constructs in its IR. The stack aims for:

Progressive lowering from high-level structured MLIR (e.g., Linalg/SCF) to static, analyzable schedules for spatial architectures.
Decoupling of compute placement, data movement, and synchronization within the IR.
Cross-generation and backend portability via a target-agnostic AIR dialect that can be lowered to hardware-specific dialects (e.g., MLIR-AIE for AMD NPUs, LLVM for general-purpose architectures).
Fine-grained management of task dispatch and concurrency, avoiding ad hoc runtime strategies or manual scheduling (Wang et al., 16 Oct 2025).

The overall pipeline integrates with frontends such as Torch-MLIR, TOSA, Triton, or IREE that emit standard MLIR IR with structured control flow (SCF) and Linalg dialects. Common MLIR passes normalize this input, followed by lowering to AIR, where loops are tiled, parallel loops become static launches/herds, tensor allocations become explicit memory operations, and asynchronous dependencies are extracted. Backend-specific lowerings then target the appropriate hardware abstraction (Wang et al., 16 Oct 2025).

2. AIR Dialect Primitives and Syntax

The AIR dialect is rooted in the MLIR SSA framework and introduces three main classes of primitives:

2.1 Scheduling Constructs

air.launch: Delineates a region to be dispatched to the accelerator with an optional iteration_space for chunking iterations.
air.segment: Reserves regions of compute resources (e.g., chiplets) that can host further nested operations.
air.herd: Defines multidimensional grids of workers (corresponding to cores plus local memory) running identical code in parallel, semantically analogous to thread blocks in CUDA/OpenMP, but statically scheduled and non-preemptive.

2.2 Data Locality and Movement Primitives

air.memcpy: Explicit source/destination memory copies with support for offsets, stride, and size; can be lowered to channel transfers for DMA.
air.channel.put and air.channel.get: Decoupled, one-way FIFO transfers between regions, with coalescing and backpressure for synchronization. Explicit channel symbols enable deterministic mapping to streams.

2.3 Synchronization and Asynchrony

!air.async.token: SSA value representing completion of an asynchronous operation.
air.await([%token,...]): Blocks until one or more provided tokens have completed.
Dependencies can be managed explicitly—enabling in-IR expression of data, control, and resource-affinity relations; these propagate to eventual schedule extraction and static analysis stages.

This explicit, analyzable IR enables the compiler to extract, transform, and schedule concurrent DMA transfers, computation, and fine-grained synchronization (Wang et al., 16 Oct 2025).

3. Lowering Pipeline and Static Scheduling

The AIR compilation pipeline consists of five major phases:

Tiling and Parallel Mapping: Structured loops are subdivided into tiles compatible with the spatial memory hierarchy (e.g., tile sizes match on-chip SRAM). Parallel/tiled loops are mapped to air.launch and air.herd constructs, selecting herds shapes that manage memory reuse versus task granularity.
Broadcast Analysis and Lowering: Detection of reuse patterns (affine broadcasts) among tiled cores; annotations for broadcast enable the backend to implement multicast on the spatial streaming network.
Asynchronous Control and Dataflow Graph Construction: Extraction of RAW, WAR, and WAW dependencies, loop-carried tokenization for pipelined stages (ping-pong buffers), and insertion of air.async.start/air.await to enable static scheduling of overlapping DMA and compute kernels.
Channel-Based Dataflow Inference: Each air.memcpy is replaced by a matched pair of air.channel.put/get operations, with user-visible channel symbols that synchronize producer/consumer and allow multi-channel, overlapped executions.
Lowering to MLIR-AIE (AMD NPU Backend): AIR spatial constructs are mapped to per-tile code, DMA block descriptors, hardware locks, and streamed memory configuration, forming a static, analyzable spatial schedule ready for hardware execution. Tracing hooks for runtime performance analysis are optionally included.

The pipeline separates spatial scheduling, communication, and computation concerns, supporting overlapping execution, broadcast/multicast, channel fusion, and efficient synchronization—all mapped statically, not left to dynamic runtime (Wang et al., 16 Oct 2025).

4. Spatial Scheduling, Overlap, and Data Movement Strategies

Spatial scheduling in MLIR-AIR is governed by the explicit resource-scoping (air.launch/air.segment/air.herd) and the precise bookkeeping of asynchrony through tokens and channels. By default, independent herds execute sequentially; explicit data and execution dependencies modulate concurrent execution and temporal overlap.

Asynchronous DMA: Each DMA launch is asynchronous and returns a token; synchronization of compute with data readiness is explicit via air.await.
Overlapping Compute and Data Transfer: By pipelining DMA transfers and employing ping-pong buffering (where separate dependency tokens are passed for alternating data blocks), AIR enables high utilization and latency hiding.
Channel Fusion: Multiple logical streams can be coalesced onto a hardware channel, minimizing descriptor count and arbitration overhead.
Memory Splitting: When the architecture supports parallel memory ports (e.g., shims), MemRef splitting exposes additional independent data streams.
Broadcast Lowering: Explicit annotation of multicast communication enables the efficient realization of one-to-many data movement over the streaming fabric.

These features provide architectural leverage to statically orchestrate complex dataflows and compute-communication overlaps crucial for high-performance spatial execution (Wang et al., 16 Oct 2025).

5. Mapping AI Workloads: Case Studies

Two principal workloads demonstrate MLIR-AIR expressiveness and performance parity with hand-optimized flows:

5.1 Matrix Multiplication (Output-Stationary Algorithm)

Tiling and mapping transforms a high-level loop nest:

for_all iOut in 0..M/t_i, jOut in 0..N/t_j :   <-- air.herd sizes=[M/t_i,N/t_j]
  for kOut in 0..K/t_k :
    for ii in 0..t_i, jj in 0..t_j, kk in 0..t_k :
      C[ii,jj] += A[ii,kk] * B[kk,jj]

This is lowered to AIR IR where tile computations are partitioned into localized buffer allocations, tiled DMA transfers via channel put/get, explicit awaits for data readiness, and local core matmul execution (Wang et al., 16 Oct 2025). AIR supports automatic broadcast where necessary and channel synchronization is statically inferred.

5.2 Fused Multi-Head Attention (LLaMA 2 Block)

Complex kernel fusion is tractable: a fused attention block (including projections, rotary embeddings, tiled matmul, softmax, and KV-cache updates) is mapped with ~150 lines of the AIR Python DSL, leveraging four DMA channels, channel fusion, and static scheduling of tiled and fused sub-kernels. The result is a 2.24× end-to-end latency improvement over sequential launches of sub-kernels, demonstrating both expressivity and scheduling efficiency (Wang et al., 16 Oct 2025).

6. Performance Evaluation

Empirical benchmarks indicate that MLIR-AIR delivers high-performance code generation:

Matrix Multiplication Throughput: Achieves up to 78.7% of peak hardware throughput for I16, and 48.6%/59.1% for BF16/I8 datatypes, with generated code consistently tracking within 5 percentage points of hand-written MLIR-AIE implementations and comparing favorably to state-of-the-art contemporaries for large tile shapes (4×4, 2×4, 2×2 herds, 64×64 tile dimensions) (Wang et al., 16 Oct 2025).
Throughput/Scalability with Problem Size: Giga-operations per second (GOPS) scale as problem size and tile mapping are varied, with larger reduction dimensions (K) amortizing pipeline startup overhead and boosting utilization.
Kernel Fusion (Multi-Head Attention): Fusion enables reduction of per-head latency from 834 µs (standalone kernels) to 373 µs in the end-to-end fused program—a clear demonstration of the stack's capacity for tractable, high-level fusion and static schedule extraction.

A plausible implication is that AIR-style IR and static scheduling generalize across similar spatial architectures, given sufficient backend support for each dialect's primitives and schedule constructs.

MLIR-AIR complements other MLIR-based compiler projects targeting high-performance AI and hardware specialization. For instance, Linalg-on-Tensor pipelines with cache-aware packing and tile-level fusion for CPUs can achieve near-ninja performance for linear algebra by fusing and mapping iterative tensor operations to micro-kernels, but rely on dynamic runtime for parallel dispatch and locality management (Golin et al., 2024). Flows for reconfigurable hardware employ similar multi-level lowering, starting from MLIR dialects and emitting hardware IR or even SystemVerilog via CIRCT and Calyx (Zang et al., 2023). In the quantum domain, MLIR-AIR architectures have been generalized to support quantum-classical IR flows with progressive lowering to machine-level representation (Nguyen et al., 2021).

The distinctive contribution of MLIR-AIR is the explicit, analyzable IR for spatially scheduled, asynchronous compute and communication—an essential advance for extracting performance from contemporary neural processing units and novel spatial fabrics. Its architectural separation of scheduling, dataflow, and synchronization at the IR level permits tractable automatic mapping, code generation, and cross-generation portability (Wang et al., 16 Oct 2025).