ONNX-MLIR Compiler Framework

Updated 24 March 2026

ONNX-MLIR is an open-source, modular compiler infrastructure that leverages MLIR and LLVM to optimize deep neural network inference from ONNX models.
Its multi-level IR design decomposes the compilation process into ONNX, krnl, affine, and LLVM layers, enabling both graph and loop-level optimizations.
Empirical evaluations indicate competitive compile and inference times on platforms like POWER9, demonstrating its practical potential in high-performance deployments.

ONNX-MLIR is an open-source compiler infrastructure designed to generate native code for the inference of deep neural network models described by the Open Neural Network Exchange (ONNX) standard. It leverages the Multi-Level Intermediate Representation (MLIR) framework—integrated with the LLVM project—to build a modular, extensible pipeline that translates high-level ONNX graphs into highly optimized machine code suitable for various deployment environments, including both high-availability servers and edge devices. ONNX-MLIR introduces two novel MLIR dialects: an ONNX-specific dialect encoding ONNX semantics and a loop-based dialect that abstracts kernel scheduling, each facilitating graph-level and loop-level optimizations, respectively (Jin et al., 2020).

1. Architecture and Compilation Workflow

ONNX-MLIR ingests models in the ONNX protocol-buffer format and outputs a native-code library that exposes an API mirroring the ONNX graph’s inputs and outputs. Internally, this transformation traverses four abstraction levels—each defined by a set of MLIR dialects:

MLIR Level	Dialects	Purpose
High-level ONNX IR	onnx, std	Encapsulates ONNX operators and model metadata
Loop-kernel IR	krnl, affine, std	Abstracts iteration spaces, kernel scheduling, and arith
Affine IR	affine, std	Contains only affine loops and standard MLIR operations
LLVM IR	llvm	Ready for LLVM code generation targeting any supported CPU

The main compilation passes are:

onnx-frontend: Python-based importer constructing the ONNX dialect.
shape-inference: Fills unknown tensor ranks/shapes within ONNX dialect.
convert-onnx-to-krnl: Lowers most ONNX ops to krnl.iterate constructs with arithmetic in affine and std.
convert-krnl-to-affine: Converts krnl scheduling to affine.for loops (materializing schedules).
convert-krnl-to-llvm and convert-std-to-llvm: Complete the lowering to the LLVM dialect.

Standard MLIR optimization passes (e.g., canonicalization, CSE, affine-loop-unroll-jam) interleave these stages for improved IR quality.

2. MLIR Dialect Design

2.1 ONNX Dialect

Each ONNX operator (e.g., Conv, Gemm, Relu, LSTM) is automatically imported as a corresponding MLIR operation in the “onnx” dialect using TableGen. Operator signatures directly mirror the ONNX specification, with tensor inputs/outputs (as tensor<...×dtype>) and ONNX attributes (e.g., strides, pads, α, β) mapped to MLIR operation attributes. A dedicated onnx.EntryPoint operation records the model’s input/output specification, directly impacting the generated shared-library entrypoint. For example, the LeakyRelu op is defined such that its α parameter is imported as a default-valued F32 attribute.

2.2 Loop-based (krnl) Dialect

The krnl dialect cleanly separates kernel computation ("what") from scheduling details ("how"). Its core construct, krnl.iterate, provides explicit induction variable bounds and optional scheduling metadata. Scheduling transformations, such as tiling (krnl.block), skewing, and permutation, are described as pure data operations. Arithmetic operations are delegated to affine or std dialects, enabling independent scalar optimizations.

3. Lowering and Transformation Strategy

The lowering process consists of three principal stages:

3.1 ONNX to krnl

The convert-onnx-to-krnl pass matches ONNX operations and emits loop nests in krnl dialect. For example, an onnx.Add operation over rank-3 tensors generates a 3-nest with induction variables, wherein affine.load, std.addf, and affine.store are used for memory access and computation:

krnl.iterate(%i, %j, %k)
  with (%i → %i0 = 0 to 3, %j → %i1 = 0 to 4, %k → %i2 = 0 to 5) {
    %a = affine.load %A[%i, %j, %k]
    %b = affine.load %B[%i, %j, %k]
    %c = std.addf %a, %b
    affine.store %c, %R[%i, %j, %k]
}

3.2 krnl to affine

KRNL scheduling metadata (e.g., from krnl.block for tiling) guides transformation to affine.for loops, so that scheduled loop nests are materialized using affine constructs and affine_maps.

3.3 Lowering to LLVM

MLIR’s built-in passes for converting std and affine dialects to LLVM IR finalize the process, enabling emission of LLVM bitcode. This output is passed to the LLVM code generator, supporting targets such as x86, POWER9, AArch64, and s390x.

4. Key Graph and Loop Transformations

Common neural network operations and computation patterns are transformed as follows:

Matrix Multiplication (ONNX MatMul): Lowered from the mathematical form

$C_{i,j} = \sum_{k=0}^{K-1} A_{i,k} B_{k,j}$

to krnl and then affine loop nests executing explicit loads, multiplications, additions, and stores as required.

Tiling and Scheduling: Loop-level optimizations are introduced at the krnl level (e.g., 32×32 tiling with krnl.block), and the resulting schedule is lowered to affine.for loops with step and range arguments implemented via affine_maps.

5. Optimization Techniques

ONNX-MLIR employs both graph-level (within the ONNX dialect) and loop-level (krnl and affine dialects) optimizations:

Graph-Level:
- Operation Decomposition: E.g., ReduceL1 decomposed into ReduceSum and Abs operations.
- Operator Fusion: E.g., MatMul + Add fused into Gemm by declarative rewrite patterns.
- Identity Elimination and Constant Folding: Values computed at compile time if all op inputs are known constants.
- Algebraic Normalization: Standard identities (commutativity, associativity) are encoded via declarative rewrite rules.
Loop-Level:
- Polyhedral Scheduling: Insertion of scheduling transformations for tiling, skewing, and permutation.
- Loop Fusion: Merging of adjacent krnl.iterate constructs.
- MLIR Affine Passes: Reuse of existing affine-loop optimizations (unroll-jam, tiling, vectorization).

6. Code Generation and Backend Support

After lowering to the LLVM dialect, ONNX-MLIR produces LLVM bitcode. All LLVM-supported CPU architectures are available “for free” for code generation. Planned future extensions include GPU code generation via MLIR’s NVVM or ROCDL dialects and potentially direct CUDA kernel generation.

7. Empirical Evaluation

Preliminary experiments on a POWER9 (2.3 GHz) platform using reference ONNX-MLIR (without advanced loop or SIMD optimizations) show promising performance:

Model	Compile Time (s)	Inference Time (s)
MNIST	0.237	0.001
ResNet50	7.661	7.540

For MNIST, compilation completes in under 250 ms and inference in approximately 1 ms. ResNet50, with 100 MB of weights and 50 layers, compiles in approximately 7.7 s and infers in 7.5 s. No direct apples-to-apples comparison is available, but the results are in the range of optimized library performance. Ongoing development addresses additional polyhedral and vectorization passes to further improve efficiency.

Summary and Context

ONNX-MLIR utilizes MLIR’s extensibility via dialects to modularize ONNX operator semantics and loop-kernel scheduling, mirroring modern compiler architecture. By leveraging declarative rewrite rules and MLIR/LLVM optimizer infrastructure, it provides a systematic means to compile ONNX models to high-performance native code. Future developments in backend support and further optimization passes are anticipated to enhance both portability and efficiency (Jin et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Compiling ONNX Neural Network Models Using MLIR (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ONNX-MLIR.