ONNX-MLIR Compiler Framework
- ONNX-MLIR is an open-source, modular compiler infrastructure that leverages MLIR and LLVM to optimize deep neural network inference from ONNX models.
- Its multi-level IR design decomposes the compilation process into ONNX, krnl, affine, and LLVM layers, enabling both graph and loop-level optimizations.
- Empirical evaluations indicate competitive compile and inference times on platforms like POWER9, demonstrating its practical potential in high-performance deployments.
ONNX-MLIR is an open-source compiler infrastructure designed to generate native code for the inference of deep neural network models described by the Open Neural Network Exchange (ONNX) standard. It leverages the Multi-Level Intermediate Representation (MLIR) framework—integrated with the LLVM project—to build a modular, extensible pipeline that translates high-level ONNX graphs into highly optimized machine code suitable for various deployment environments, including both high-availability servers and edge devices. ONNX-MLIR introduces two novel MLIR dialects: an ONNX-specific dialect encoding ONNX semantics and a loop-based dialect that abstracts kernel scheduling, each facilitating graph-level and loop-level optimizations, respectively (Jin et al., 2020).
1. Architecture and Compilation Workflow
ONNX-MLIR ingests models in the ONNX protocol-buffer format and outputs a native-code library that exposes an API mirroring the ONNX graph’s inputs and outputs. Internally, this transformation traverses four abstraction levels—each defined by a set of MLIR dialects:
| MLIR Level | Dialects | Purpose |
|---|---|---|
| High-level ONNX IR | onnx, std | Encapsulates ONNX operators and model metadata |
| Loop-kernel IR | krnl, affine, std | Abstracts iteration spaces, kernel scheduling, and arith |
| Affine IR | affine, std | Contains only affine loops and standard MLIR operations |
| LLVM IR | llvm | Ready for LLVM code generation targeting any supported CPU |
The main compilation passes are:
onnx-frontend: Python-based importer constructing the ONNX dialect.shape-inference: Fills unknown tensor ranks/shapes within ONNX dialect.convert-onnx-to-krnl: Lowers most ONNX ops to krnl.iterate constructs with arithmetic in affine and std.convert-krnl-to-affine: Converts krnl scheduling to affine.for loops (materializing schedules).convert-krnl-to-llvmandconvert-std-to-llvm: Complete the lowering to the LLVM dialect.
Standard MLIR optimization passes (e.g., canonicalization, CSE, affine-loop-unroll-jam) interleave these stages for improved IR quality.
2. MLIR Dialect Design
2.1 ONNX Dialect
Each ONNX operator (e.g., Conv, Gemm, Relu, LSTM) is automatically imported as a corresponding MLIR operation in the “onnx” dialect using TableGen. Operator signatures directly mirror the ONNX specification, with tensor inputs/outputs (as tensor<...×dtype>) and ONNX attributes (e.g., strides, pads, α, β) mapped to MLIR operation attributes. A dedicated onnx.EntryPoint operation records the model’s input/output specification, directly impacting the generated shared-library entrypoint. For example, the LeakyRelu op is defined such that its α parameter is imported as a default-valued F32 attribute.
2.2 Loop-based (krnl) Dialect
The krnl dialect cleanly separates kernel computation ("what") from scheduling details ("how"). Its core construct, krnl.iterate, provides explicit induction variable bounds and optional scheduling metadata. Scheduling transformations, such as tiling (krnl.block), skewing, and permutation, are described as pure data operations. Arithmetic operations are delegated to affine or std dialects, enabling independent scalar optimizations.
3. Lowering and Transformation Strategy
The lowering process consists of three principal stages:
3.1 ONNX to krnl
The convert-onnx-to-krnl pass matches ONNX operations and emits loop nests in krnl dialect. For example, an onnx.Add operation over rank-3 tensors generates a 3-nest with induction variables, wherein affine.load, std.addf, and affine.store are used for memory access and computation:
1 2 3 4 5 6 7 |
krnl.iterate(%i, %j, %k)
with (%i → %i0 = 0 to 3, %j → %i1 = 0 to 4, %k → %i2 = 0 to 5) {
%a = affine.load %A[%i, %j, %k]
%b = affine.load %B[%i, %j, %k]
%c = std.addf %a, %b
affine.store %c, %R[%i, %j, %k]
} |
3.2 krnl to affine
KRNL scheduling metadata (e.g., from krnl.block for tiling) guides transformation to affine.for loops, so that scheduled loop nests are materialized using affine constructs and affine_maps.
3.3 Lowering to LLVM
MLIR’s built-in passes for converting std and affine dialects to LLVM IR finalize the process, enabling emission of LLVM bitcode. This output is passed to the LLVM code generator, supporting targets such as x86, POWER9, AArch64, and s390x.
4. Key Graph and Loop Transformations
Common neural network operations and computation patterns are transformed as follows:
- Matrix Multiplication (ONNX MatMul): Lowered from the mathematical form
to krnl and then affine loop nests executing explicit loads, multiplications, additions, and stores as required.
- Tiling and Scheduling: Loop-level optimizations are introduced at the krnl level (e.g., 32×32 tiling with krnl.block), and the resulting schedule is lowered to affine.for loops with step and range arguments implemented via affine_maps.
5. Optimization Techniques
ONNX-MLIR employs both graph-level (within the ONNX dialect) and loop-level (krnl and affine dialects) optimizations:
- Graph-Level:
- Operation Decomposition: E.g., ReduceL1 decomposed into ReduceSum and Abs operations.
- Operator Fusion: E.g., MatMul + Add fused into Gemm by declarative rewrite patterns.
- Identity Elimination and Constant Folding: Values computed at compile time if all op inputs are known constants.
- Algebraic Normalization: Standard identities (commutativity, associativity) are encoded via declarative rewrite rules.
- Loop-Level:
- Polyhedral Scheduling: Insertion of scheduling transformations for tiling, skewing, and permutation.
- Loop Fusion: Merging of adjacent krnl.iterate constructs.
- MLIR Affine Passes: Reuse of existing affine-loop optimizations (unroll-jam, tiling, vectorization).
6. Code Generation and Backend Support
After lowering to the LLVM dialect, ONNX-MLIR produces LLVM bitcode. All LLVM-supported CPU architectures are available “for free” for code generation. Planned future extensions include GPU code generation via MLIR’s NVVM or ROCDL dialects and potentially direct CUDA kernel generation.
7. Empirical Evaluation
Preliminary experiments on a POWER9 (2.3 GHz) platform using reference ONNX-MLIR (without advanced loop or SIMD optimizations) show promising performance:
| Model | Compile Time (s) | Inference Time (s) |
|---|---|---|
| MNIST | 0.237 | 0.001 |
| ResNet50 | 7.661 | 7.540 |
For MNIST, compilation completes in under 250 ms and inference in approximately 1 ms. ResNet50, with 100 MB of weights and 50 layers, compiles in approximately 7.7 s and infers in 7.5 s. No direct apples-to-apples comparison is available, but the results are in the range of optimized library performance. Ongoing development addresses additional polyhedral and vectorization passes to further improve efficiency.
Summary and Context
ONNX-MLIR utilizes MLIR’s extensibility via dialects to modularize ONNX operator semantics and loop-kernel scheduling, mirroring modern compiler architecture. By leveraging declarative rewrite rules and MLIR/LLVM optimizer infrastructure, it provides a systematic means to compile ONNX models to high-performance native code. Future developments in backend support and further optimization passes are anticipated to enhance both portability and efficiency (Jin et al., 2020).