MLIR-Based Lowering Pipeline
- MLIR-based lowering pipeline is a multi-stage compilation process that systematically transforms high-level program representations into hardware-targeted code.
- It employs a structured sequence of transformation passes—including canonicalization, affine control flow generation, and target-aware lowering—to optimize performance on CPUs, GPUs, NPUs, and FPGAs.
- The pipeline integrates domain-specific dialects and rigorous mathematical optimizations to ensure end-to-end performance, portability, and extensibility in diverse computing environments.
The MLIR-based lowering pipeline refers to a structured, multi-stage compilation process in which high-level program representations—often from domains such as machine learning, scientific computing, or domain-specific languages—are systematically transformed (“lowered”) through successively more hardware-proximate dialects of the Multi-Level Intermediate Representation (MLIR). This progression leverages MLIR’s modular dialect/pluggable pass system to apply optimizations, canonicalizations, and target-aware mappings, ultimately emitting code that is highly tuned for CPUs, GPUs, NPUs, FPGAs, or other specialized architectures. MLIR-based lowering pipelines are foundational in modern compiler and accelerator toolchains, enabling domain experts to express semantics at an abstract level without sacrificing end-to-end performance or portability.
1. Dialect Design and High-Level Entry Points
An MLIR-based lowering pipeline begins with the ingestion of high-level IR, which may originate from general-purpose languages (C/C++, Fortran), machine learning frameworks (PyTorch, TensorFlow, Triton), or domain-specific languages (DSLs for tensor algebra, stencils, quantum assembly). Each workflow introduces custom dialects to semantically encode operations relevant to the input domain:
- Machine Learning & Linear Algebra: Linalg-on-Tensor, torch, triton, and custom compiler dialects such as scanweaver.scan for selective scan recurrences (Wu et al., 30 May 2026).
- Scientific Computing: Stencil and DMP dialects for structured PDE-discretization operators (Stawinoga et al., 25 Jan 2026).
- Domain-Specific Language Compilers: ekl dialect for NumPy-like DSLs, sdfg dialect for data-centric task graphs (Friebel et al., 21 Apr 2026, Ben-Nun et al., 2023).
- Hardware Abstraction: TOP and TPU dialects for NNs targeting application-specific integrated circuits (Hu et al., 2022); custom quantum dialects for OpenQASM/QIR (McCaskey et al., 2021).
- Hardware-Specific Intrinsics: rvv for RISC-V vector, hls for FPGA HLS, gpu/nvvm for CUDA (Lei et al., 18 Mar 2026, Rodriguez-Canal et al., 11 Nov 2025, Katel et al., 2021, Absar et al., 23 Feb 2026).
At this stage, the IR encodes the intent of the computation with maximal semantic information, including structural parallelism, functional dependencies, and, when available, type and shape information for subtyping or algebraic rewrites.
2. Transformation Passes and Dialect Lowering Sequence
MLIR-based lowering is orchestrated as an explicit pipeline: a directed sequence of transformation passes, each converting between one or more dialects, rewriting operations, or specializing the representation.
Major stages typically include:
- High-Level Normalization and Enrichment: Domain-specific transformations (e.g., algebraic fusion, normalization, explicit broadcast insertion for tensor DSLs) enhance the IR for parallelism or data reuse (Friebel et al., 21 Apr 2026).
- Canonicalization and Common Subexpression Elimination: Optimization passes clean up redundant IR structure and prepare for aggressive rewrites across dialect boundaries—see scanweaver.affine_scan and linalg fusion (Wu et al., 30 May 2026, Absar et al., 23 Feb 2026).
- Affine & Structured Control Flow Generation: High-level control constructs are expressed in MLIR’s affine/scf dialects, enabling precise tiling, loop transformations, and algebraic mapping. This includes affine tiling for matmuls, bufferization for memory management, and explicit prefix-scan operators for recurrences (Katel et al., 2021, Wu et al., 30 May 2026).
- Target-Aware Lowering: Critical lowering stages introduce target-specific dialects/ops, such as gpu (CUDA/NVPTX), hls (FPGA pipelines), rvv (vector intrinsics), or device (explicit host/device compute partitioning). In Hexagon-MLIR, this involves mapping linalg.generic to scf.for, inserting TCM-specific bufferization and DMA, and lowering vector ops to HVX intrinsics (Absar et al., 23 Feb 2026, Rodriguez-Canal et al., 11 Nov 2025, Lei et al., 18 Mar 2026).
- Hardware-Specific Code Generation: The penultimate steps include lowering to IRs such as LLVM (for CPUs, CUDA, or vendor HLS tools), CIRCT/calyx for RTL/FPGA, or Wasm dialects for WebAssembly export. Toolchains like ScanWeaver generate GPU PTX via MLIR’s gpu/NVVM dialect, passing through buffers, shared memory, and synchronization (Wu et al., 30 May 2026, Zang et al., 2023, Kang et al., 19 Jun 2025).
Pass ordering and dialect transitions are strictly controlled to preserve invariants (e.g., no affine loops post-unrolling, no memrefs at HW/Verif miter construction). This precise sequencing allows analyses such as tile-size selection, symbolic parameter propagation, and optimal resource partitioning.
3. Mathematical and Algorithmic Foundations
MLIR-based lowering pipelines are often grounded in explicit mathematical formalisms:
- Affine Recurrence Decomposition: For selective scans, recurrences of the form are decomposed into associative operator pairs , enabling parallel Blelloch scan schedules via structured prefix-scan rewrite (Wu et al., 30 May 2026).
- Tiling and Packing Heuristics: Cache and register tiling leverage integer-programmed models, e.g.,
with analytical constraints for packing routines, optimal fusion, and scheduling (Ferrari et al., 22 Nov 2025).
- Memory Hierarchy Mapping: Explicit tile sizing and bufferization respect per-target memory hierarchies, as in explicit TCM/DDR buffer assignment for Qualcomm Hexagon NPU; tiling dimensions are algorithmically selected to maximize locality (Absar et al., 23 Feb 2026).
- Formal Control Flow Transforms: Conversion of for-loops to hardware-friendly constructs (e.g., scf.for → hls.pipeline + hls.unroll) models pipelining intervals and resource occupancy based on mathematical latency and area models (Rodriguez-Canal et al., 11 Nov 2025).
- Type System Extensions: Dialect-agnostic type inference, subtyping, and broadcasting rules, encoded as fixpoint iterations over Horn-logic typing, enable sophisticated DSL lowering (see TypeCheckOpInterface and subtyping join operations in (Friebel et al., 21 Apr 2026)).
These foundations enforce that the IR transformations are both correct and suitable for aggressive optimization.
4. Implementation Patterns: Pseudocode and IR Skeletons
Typical lowering pipelines are realized through pattern rewrites, each articulated as MLIR RewritePatterns or pass subclasses:
- Explicit Operator Conversion: Each high-level operation (e.g., scanweaver.scan, linalg.conv_2d_nchw_fchw) is matched and replaced by lower-level composite ops (affine_scan, linalg.generic, or scf.for), with explicit handling of SSA values, memory locations, and region rewrites.
- Associative Scan Lowering Example (Wu et al., 30 May 2026):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
%y = scanweaver.scan(%x, %a, %b, %c) : (memref<LxD>, ...) -> memref<LxD>
...
%pairs = memref.alloc() : memref<Lx2xD, workgroup>
scf.for %i = 0 to L step 1 {
%At = memref.load %a[%i]
%Xt = memref.load %x[%i]
%Ut = mulf %b[%i], %Xt
memref.store %At, %pairs[%i, 0]
memref.store %Ut, %pairs[%i, 1]
}
...
%scan_out = scanweaver.assoc_scan(%pairs)
/* Blelloch upsweep/downsweep inserted */
... |
- Hardware Mapping Example: For FPGA, hls.pipeline wraps scf.for bodies; hls.unroll unrolls loops by a specified factor. On NPUs, DMA and vector intrinsics are introduced just before the final LLVM lowering (Rodriguez-Canal et al., 11 Nov 2025, Absar et al., 23 Feb 2026).
- Type and Shape Propagation: Transformation passes propagate symbolic and concrete type/shape data, enabling automated dimension checking, packing, and scheduling.
These rewriting and canonicalization stages are composable and statically analyzable, central to MLIR’s model.
5. Target Code Generation and Runtime Integration
Terminal stages of MLIR-based lowering produce IR consumable by hardware-specific toolchains or runtime environments:
- CPUs & General-Purpose Backends: Lowering to LLVM IR with optional OpenMP threading, leveraging custom dialects (emitc, xsmm) and dispatcher/invoke calls for micro-kernel registration (Golin et al., 2024, Lei et al., 18 Mar 2026).
- GPUs: gpu.launch and gpu.func IR regions map computational blocks/threads and allocate shared memory for e.g., Blelloch scans. GPU-specific dialects (NVVM) introduce tensor-core (WMMA) intrinsics and explicit barriers (Katel et al., 2021, Wu et al., 30 May 2026).
- FPGA/RTL: CIRCT passes lower SCF/arith to FIRRTL, then Calyx IR, enabling SystemVerilog emission; memory banking, register unrolling, and control FSMs are generated to match hardware resource constraints (Zang et al., 2023).
- NPUs & AI Accelerators: Bufferization to memory_space-annotated memrefs (e.g., TCM/DDR), double-buffered DMA, vectorized code via custom matcher passes, and insertion of library calls to vendor-specific math routines (QHL) (Absar et al., 23 Feb 2026).
- Quantum/Other Models: Target-specific lowering to QIR via LLVM dialect ops and runtime interface generation for hybrid quantum-classical execution (McCaskey et al., 2021).
- WebAssembly: High-level SsaWasm/wasm dialects preserve GC, stack-switching, and continuation features until final emission, avoiding information loss found in LLVM-based wasm toolchains (Kang et al., 19 Jun 2025).
This approach supports vendor tool integration (e.g., Vitis HLS for FPGA, CUDA for GPU, Hexagon assembler for NPU) using well-formed IR bridges.
6. Evaluation, Performance, and Extensibility
MLIR-based lowering pipelines have been empirically validated across diverse workloads and hardware targets:
- Performance: Across various pipelines, compiler-generated code consistently achieves 90–105% of hand-tuned or vendor baseline throughput in GEMM, tensor contraction, and stencil benchmarks (Golin et al., 2024, Absar et al., 23 Feb 2026, Stawinoga et al., 25 Jan 2026, Hu et al., 2022).
- Parallelization: Transforming recurrences using associative scans reduces computation depth to , producing two orders of magnitude speedup for sequence models (Blelloch scan vs. sequential recurrence, 0.032 ms vs. 0.469 ms at ) (Wu et al., 30 May 2026). MLIR pipelines also realize near-linear scaling for OpenMP, scf.parallel, and async dialects.
- Portability: Modular dialects and lowering passes facilitate rapid adoption for novel hardware targets (RVV, SVE, AVX, FPGAs, NPUs). Custom dialects enable precise exploitation of hardware features (e.g., device, hls, rvv, tpu), and passes can be adapted or rapidly reimplemented in Python/xDSL or C++ (Lei et al., 18 Mar 2026, Rodriguez-Canal et al., 11 Nov 2025).
- Correctness and Verification: Many toolflows embed lightweight IR-level simulation (InferenceInterface), runtime type/shape checks, and conformance verification (e.g., QIR for quantum) to guarantee correctness at each lowering stage (Hu et al., 2022, McCaskey et al., 2021).
- Extensibility: Domain specialists can introduce new dialects and passes with minimal engineering effort, promote domain-specific scheduling (e.g., actor-based asynchronous mapping for WSE stencils), and preserve nontrivial program semantics throughout the lowering chain (Stawinoga et al., 25 Jan 2026, Friebel et al., 21 Apr 2026).
- Limitations: Some pipelines remain dependent on lower-level IR printers (e.g., MLIR emitc for C), missing core packing/macro-kernel lowerings, or require further refinement for full-threaded parallelism in certain DSLs (Lei et al., 18 Mar 2026, Friebel et al., 21 Apr 2026).
7. Schematic Summary and Core Examples
The following table summarizes several canonical MLIR-based lowering flows (extracted directly from the referenced works):
| Input Type | Key Dialects/Passes | Final Target / Kernel Generation |
|---|---|---|
| PyTorch/Trition (AI) | torch/triton → linalg → vector/scf/memref → HVX/async | Hexagon NPU binary, TCM-optimized mega-kernels (Absar et al., 23 Feb 2026) |
| Selective Scan Recurrence | scanweaver.scan → affine_scan → assoc_scan → blelloch → gpu → NVVM | CUDA/PTX kernel, O(log L) depth (Wu et al., 30 May 2026) |
| Fortran + OpenMP | fir/hlfir → memref/scf/openmp → device/hls → LLVM | FPGA HLS, Vitis IP, <1% perf gap w/hand HLS (Rodriguez-Canal et al., 11 Nov 2025) |
| Quantum Assembly (OpenQASM) | quantum → LLVM → QIR | Hybrid quantum-classical executable (McCaskey et al., 2021) |
| NumPy-like DSL | ekl → linalg → dfg-mlir → HLS/Etna/cpu | MLIR-native pipeline, plug-and-play FPGA/CPU (Friebel et al., 21 Apr 2026) |
| RTL for FPGA from SYCL | affine/scf/arith → firrtl/calyx → SystemVerilog | AXI-integrated bitstream, vendor-agnostic (Zang et al., 2023) |
Each pipeline is tailored, but all share a commitment to maximal semantic retention, analyzability, and target-directed, pass-oriented lowering.
MLIR-based lowering pipelines provide a principled, modular, and extensible infrastructure for transforming high-level, domain-rich programs into efficient, hardware-targeted executables. They enable translation of semantic abstractions into hardware-proximate forms through controlled dialect transitions, mathematically grounded optimization, and integration with vendor backends and runtime systems, driving contemporary advancements in AI, scientific computing, and domain-specific systems (Wu et al., 30 May 2026, Absar et al., 23 Feb 2026, Rodriguez-Canal et al., 11 Nov 2025, Golin et al., 2024, Hu et al., 2022, McCaskey et al., 2021).