MLIR-Forge: Modular IR Fuzzing Engine

Updated 21 January 2026

MLIR-Forge is a modular, extensible framework that constructs random program generators to validate domain-specific IRs within the LLVM ecosystem.
It separates language-specific definitions from a reusable generation engine, streamlining the creation of well-formed IR-level fuzzing and differential-testing tools.
Empirical case studies demonstrate its effectiveness in bug detection across multiple dialects and its potential in enhancing compiler validation and hardware mapping.

MLIR-Forge is a modular, extensible framework for constructing random program generators (“smiths”) for new or existing dialects within the LLVM Multi-Level Intermediate Representation (MLIR) ecosystem. MLIR-Forge formalizes and simplifies the process of building IR-level fuzzers and differential-testing tools, allowing rapid testing and validation of domain-specific intermediate representations (IRs) across compilers. It achieves this by separating language-specific definitions (operations, types, insertion constraints) from a reusable, dialect-agnostic engine that synthesizes well-formed random programs, thus lowering the cost of robust compiler IR verification in research and industry (Ates et al., 14 Jan 2026).

1. Motivation and Prior Approaches

Compiler testing via randomly generated programs is established as an effective methodology for uncovering subtle bugs in parser, verifier, and optimization logic. Traditional solutions such as Csmith and DeepSmith successfully fuzz C/C++ and OpenCL compilers but require tens of thousands of hand-written lines of code for each language, do not generalize to new IRs, and lack modular composability. Grammar-based approaches like PolyGlot and Xsmith ease parsing and test generation but are weak in generating semantically valid IR for differential optimization testing. Pre-existing methods involving translation from C via tools such as Polygeist risk introducing translation artifacts irrelevant to the dialect under test (Ates et al., 14 Jan 2026).

MLIR-Forge addresses these limitations by:

Directly leveraging MLIR’s dialect and operation abstraction for IR-centric random program generation.
Decoupling IR assembly logic from dialect specification, reducing development to the definition of per-operation generation logic.
Providing an extensible foundation for targeting both MLIR-native and external IRs (e.g., WebAssembly, DaCe).

2. Core Architecture: Separation of Language and Generation Logic

MLIR-Forge’s architecture is defined by a “jigsaw puzzle” model:

Language Specification: Each dialect defines operations and types, typically via MLIR’s Operation Definition Specification (ODS) and TableGen.
Operation and Type Generators (“Gens”): For each operation, a minimal C++ trait GeneratableOp<ThisOp> declares a generate(IRBuilder &b) method which attempts to insert the op at the current location, returning nullptr on failure (e.g., due to unsatisfied operand constraints).
Reusable Generation Engine: The engine maintains an insertion point, tracks SSA values and scopes, and samples OpGens based on configurable weights. It recursively fills IR regions, enforces contextual correctness, and bounds operation/region depth based on user parameters.
Formal Probability Control: A per-op weight model ensures statistically meaningful distribution of operation frequencies; operations are sampled based on weights and can be further restricted by semantic/contextual guards.

This modular split enables the rapid development—often under one week and less than 3,000 LoC—for new “smiths” supporting a target dialect or combination of dialects.

3. Implementation, Extensibility, and Usage Pattern

MLIR-Forge is realized as an LLVM MLIR extension and exposes its features primarily through:

Interfaces & CLI: Users register dialects and top-level ops, implement generate methods for new operations, and configure generator flags such as --seed, --max-ops, and --prob-file for operation weights.
IRBuilder: A wrapper over MLIR’s existing OpBuilder, tracking in-scope SSA values, enforcing region nesting limits, and ensuring legal insertion points.

Dialect Definition and Registration: To support a new dialect, authors extend ODS specs, annotate ops with the GeneratableOp trait, and provide the minimal generator logic (typically 20–500 LoC per dialect). An example registration:

static void registerMySmith() {
  Forge::registerDialect<MyDialect>();
  Forge::registerTopLevelOp<MyTopLevelOp>();
  Forge::installCLI();
}
static RegisterForgeSmith X("my-smith", "My IR smith", registerMySmith);

Sample Generation Loop: OpGens with nonzero weight are selected in a weighted random loop, invoked to attempt operation insertion, and re-sampled until max-ops or exhaustion.

Effort and codebase size are minimal: case studies report lines of code per dialect on the order of hundreds for both ops and types.

Dialect	#Types	LoC(types)	#Ops	LoC(ops)
arith, math	—	—	20	2,329
memref/types	—	9	1	248
scf	—	—	3	179
func	—	—	1	56
SDFG	24	24	30	423

4. Case Studies and Empirical Evaluation

MLIR-Forge has been applied across diverse IR ecosystems:

SDFG-Smith (DaCe): Generated MLIR for the SDFG dialect, translated to the DaCe graph IR. Over 20 hours, it produced 14,681 programs and uncovered 774 unique bug groups in DaCe’s auto-optimizer. Representative bugs included the incorrect removal of parallel scope outputs during nested graph unrolling (Ates et al., 14 Jan 2026).
MLIR-Smith (Core Dialects): Covered SCF, Arith, Math, Memref, and Func dialects. Generated 23,065 programs in 10 hours, revealing 9 bug groups in the optimizer/lowering passes. Notably, dead code elimination removed infinite side-effect-free loops, causing semantic ambiguity in program termination (Ates et al., 14 Jan 2026).
WASM-Smith (WebAssembly): Built on MLIR→LLVM→emcc codegen paths, generated 12,014 programs and found 15 bug classes—example: folding math.fpowi 0 0 triggered a -O3 crash.
Performance: On commodity hardware (Intel i9 3.0 GHz), median time to generate programs ≤20 KiB is ≤40 ms, with memory footprints ≤36 MiB; this enables scalable integration with CI and exhaustive fuzz campaigns.

Bug discovery converges after tens of thousands of test cases (DaCe: <20 h), enabling efficient regression detection and compiler validation.

5. Integration with Transform Dialects and Advanced Scheduling

MLIR-Forge’s design enables seamless integration with the strategies pioneered in the MLIR Transform dialect (Lücke et al., 2024), notably:

Schedules as IR: Transformation and scheduling scripts may be emitted as first-class IR, separate from but referring to payload programs. SSA-based “handles” and parameters enable determinism and static analysis of transformation effects.
Pre-/Post-Conditions and IRDL: By leveraging pre- and post-conditions on transformation ops—as in the Transform dialect paradigm—MLIR-Forge can formalize expected payload invariants, facilitating early detection of phase-ordering and correctness bugs.
Extensible Plugin APIs: New transformations, passes, or target-specific optimizations can be registered through lightweight plugin interfaces without modifying core infrastructure.
Empirical Validation: Integration with autotuning, pass pipelines, and externally defined optimization schedules (e.g., BACO or library replacements) is facilitated by scriptable transform IR that can be manipulated, inspected, and dynamically composed.

This enables developers to not only generate payload IR for random testing but also scheme and validate complex transformation pipelines (“meta-fuzzing”).

6. Hardware-Centric Applications and Reconfigurable Devices

MLIR-Forge underpins automatic compiler flows for reconfigurable hardware, as shown by its application in generating and lowering SYCL kernels to hardware implementations (via CIRCT, Calyx, and SystemVerilog) (Zang et al., 2023):

SYCL Front-End Parsing: DPC++ driver emits MLIR, which is legalised and bufferized.
Custom Lowering Pipelines: Passes convert MLIR dialects (affine, scf, arith, etc.) into hardware dialects (CIRCT’s FIRRTL, Calyx), with strict cost models and resource tradeoffs.
Back-End Synthesis: Final hardware modules are mapped as AXI-full IPs and integrated with crossbar interconnects.
Performance Results: Resource usage and latency scale with program transformation choices (e.g., loop unrolling); empirical results for GEMM validate the practical feasibility and cost-model predictions of the approach.

These flows benefit from the modularity and testability fostered by MLIR-Forge, expediting verification and exploration cycles.

7. Benefits, Limitations, and Directions for Future Work

Benefits:

Modular and rapid development: new dialect support requires only minimal generator logic per op/type, with reusable generic program logic.
Reuse of features: binary ops, type generators, and structural patterns easily shared across dialects.
Out-of-the-box support for differential and fuzz testing across MLIR and translated IRs (WebAssembly, DaCe).
Scalable CI and local deployment, modest compute and memory requirements.

Limitations:

Region result types currently require terminator-driven assignment, lacking first-class grammar integration.
Type requirements in subregions (e.g., always needing a boolean) must be enforced manually by dialect authors.
Does not guarantee semantic program safety (division by zero, aliasing, illegal memory access); these must be enforced at the generator level.
Integration with IRDL, richer constraint grammars, and AST traversal caching remain targets for enhancement (Ates et al., 14 Jan 2026).

A plausible implication is that as IRDL and ODS become more tightly integrated, the effort for defining new dialect smiths will further decrease and deeper type-driven correctness enforcement may be realized automatically. This suggests the long-term viability of MLIR-Forge as a “fuzzing substrate” not only for DSL compilers but also for next-generation transform-scheduling and hardware mapping frameworks.