MLIR-Smith: Random Program Generator
- MLIR-Smith is a grammar-guided random program generator for MLIR that creates valid, diverse modules to test MLIR-based compiler optimizations.
- It features a modular architecture with a trait-based GeneratableOpInterface to easily support user-defined, extensible dialects.
- MLIR-Smith supports differential testing across multiple compiler pipelines and has detected critical functional bugs and optimization gaps.
MLIR-Smith is a grammar-guided random program generator tailored for the Multi-Level Intermediate Representation (MLIR) ecosystem, designed to enable rigorous testing and evaluation of MLIR-based compiler optimizations. Unlike prior random program generation tools such as Csmith, MLIR-Smith explicitly addresses the challenges posed by MLIR's user-extensible dialects and the lack of a fixed grammar, providing a dialect-agnostic mechanism to generate valid, diverse MLIR modules. Its introduction fills a critical gap in the testing infrastructure for compiler pipelines that leverage MLIR, LLVM, and related frameworks (Ates et al., 5 Jan 2026).
1. Motivation and Context
The utility of random-program generation in compiler validation has been prominently demonstrated by tools such as Csmith, which discovered numerous bugs and missed optimizations in C compilers through the automatic synthesis of safe, terminating C programs. However, MLIR presents unique difficulties: dialects are user-defined and extensible, their operations are governed by heterogeneous semantic constraints, and no pre-existing Csmith-style tools are applicable. MLIR-Smith was developed to address these needs, providing a platform-independent generator capable of targeting arbitrary dialect sets and supporting deep configurability over module structure, control flow, and operation density (Ates et al., 5 Jan 2026).
2. Core Architecture and Components
MLIR-Smith comprises three principal components:
- Configuration Core: Parses user-supplied configuration, initializes the MLIR module, and sets up the
func @mainregion. - GeneratorOpBuilder: An extension of MLIR's native
OpBuilder, this helper class manages the sampling and emission of operations under user-defined constraints on block length and nesting depth. - GeneratableOpInterface: A trait and interface that dialect authors attach to each operation. This enables MLIR-Smith to discover, categorize, and invoke per-operation generation routines at runtime via the MLIR dialect registry.
MLIR-Smith does not attempt to parse the full MLIR grammar, which is distributed between C++ and TableGen specifications. Instead, each GeneratableOpInterface instance exposes two functions: getGeneratableTypes(), returning result types valid in context, and generate(), responsible for producing the operation by synthesizing operands (recursively sampling operations or using pre-existing values) and invoking the builder. This design ensures MLIR-Smith remains fully dialect-agnostic; support for new dialects requires only trait attachment and concise C++ implementations describing operand selection and region semantics (Ates et al., 5 Jan 2026).
3. Random Program Generation Algorithm
MLIR-Smith constructs valid MLIR modules using a top-down "block-filling" strategy, analogous in philosophy to Csmith. The algorithm operates as follows:
- Block Termination Selection: The intended terminator type (e.g.,
return,yield, fall-through) is sampled for each block. - Operation Sampling: Enabled generatable operations are assigned weights , forming a discrete distribution:
The chosen operation's generate() function is invoked.
- Operand and Type Constraints: If the operation can be instantiated given existing values or recursively sampled operands, it is appended; otherwise, the failed operation is removed from contention, and another is sampled.
- Termination Criteria: The process continues until reaching the maximum block length or until only terminators remain feasible.
Sampling is governed by user-defined or default distributions—uniform over enabled operations, and geometric over nesting levels:
for the loop-nest level , where the expected depth is typically kept small (default depth limit is 4) (Ates et al., 5 Jan 2026).
4. Soundness, Constraints, and Reproducibility
To ensure soundness of generated programs—excluding out-of-bounds accesses or missing terminators—MLIR-Smith:
- Enforces static dimension bounds (up to 100,000) on memrefs.
- Disallows strided affine maps.
- Aborts any branch attempting to generate dynamic shapes or unsupported operations, retrying alternative branches.
- Uses global parameters
regionDepthLimitandblockLengthto prevent unavoidable nontermination and deep recursion. - Discards any randomly generated module exceeding a fixed execution timeout.
All randomization draws are performed by a single, user-seedable std::mt19937_64 instance, ensuring reproducibility. Users can adjust per-operation weights in JSON or YAML configuration, for example raising the weight of scf.for to promote frequent loop generation (Ates et al., 5 Jan 2026).
5. Differential Testing Workflows
Upon generation of a random MLIR module, MLIR-Smith orchestrates differential testing across four major compilation pipelines:
| Pipeline Name | Stages | Distinctiveness |
|---|---|---|
| MLIR pipeline | mlir-opt passes → LLVM dialect → LLVM IR → clang -O0 |
Full MLIR opt passes + LLVM backend |
| LLVM pipeline | MLIR-to-LLVM dialect → LLVM IR → opt -O3 → compile |
Skips MLIR opt passes; relies on LLVM optimization |
| DaCe pipeline | MLIR → SDFG dialect (sdfg-opt) → SDFG IR → DaCe optimizer → compile |
Uses SDFG as an intermediate, DaCe's auto-optimizer |
| DCIR pipeline | MLIR–opt passes → SDFG dialect → DaCe optimizer → compile | Combines MLIR and DaCe pipelines |
A shell harness (diff_test.sh) automates large-scale test campaigns, monitors compiler errors, mismatches, segmentation faults, timeouts, and missed optimizations (e.g., failure to elide external markers), and records program details (each test typically <200 KB) (Ates et al., 5 Jan 2026).
6. Empirical Results: Bug Discovery and Analysis
Empirical campaigns using several hundred generated programs led to identification and confirmation of significant defects in multiple pipelines, summarized as:
- DaCe DCE bug: Live-analysis failed to remove unused
memref.allocin SDFG, inhibiting dead-code elimination. - DCIR translation bug: Incorrect lowering of
arith.extsion a boolean resulted in mapping to 0 instead of due to an unsigned move. - MLIR missed optimization: Store-load pairs on large statically allocated memrefs were not eliminated, provoking a segmentation fault, whereas all other pipelines successfully removed the redundant accesses.
By comparing program behaviors and generated code artifacts, MLIR-Smith demonstrated capability to detect both functional bugs and optimization coverage gaps across diverse compiler infrastructures, even in absence of a formal ground truth (Ates et al., 5 Jan 2026).
7. Extensibility and Future Directions
MLIR-Smith adopts a trait-based, plug-and-play model, allowing immediate support for new dialects by annotating operations with GeneratableOpInterface methods. Future anticipated enhancements include:
- Expansion to additional dialects (affine, vector, GPU).
- Support for composite types, array-of-struct types, and unbounded type families.
- Integration of liveness-driven sampling (e.g., in the style of Barány 2017) and optimization markers to stress-test elimination capabilities.
- Statistical analysis of corpus properties (fail-rate curves , confidence intervals for bug detection) as scale increases.
These extensions aim to further harden the MLIR ecosystem and inspire analogous approaches in other multi-level IR infrastructures (Ates et al., 5 Jan 2026).