MLIR OpenMP Dialect
- MLIR OpenMP Dialect is a structured intermediate representation that encapsulates OpenMP's directive-based parallelism for compiler analysis, transformation, and lowering.
- It supports key constructs such as omp.target, omp.parallel, and omp.for to facilitate device offloading and integration with HLS dialects for efficient FPGA code generation.
- Its transformation pipeline leverages canonical MLIR passes to decompose parallel regions into device-specific operations, optimizing performance for heterogeneous systems.
The MLIR OpenMP dialect is a structured intermediate representation (IR) within the Multi-Level Intermediate Representation (MLIR) framework, which encodes OpenMP’s directive-based parallelism at the IR level for compiler analysis, transformation, and lowering. It encapsulates parallel regions, offloading constructs, and data mappings found in OpenMP, providing an abstraction layer that facilitates integration, composable transformations, and targeting of heterogeneous devices, including Field Programmable Gate Arrays (FPGAs) through further lowering passes. Recent research demonstrates comprehensive pipelines that utilize this dialect to enable efficient code offloading—most notably for Fortran via Flang—to FPGAs, with robust support for both standard OpenMP semantics and extension points for device-specific optimizations (Rodriguez-Canal et al., 11 Nov 2025). Additionally, the UPIR approach illustrates how OpenMP constructs are abstracted for unified parallel IR export to MLIR (Wang et al., 2022).
1. Core Operations and Semantics
The MLIR OpenMP dialect defines a set of canonical region-based operations, each corresponding to OpenMP constructs:
- omp.target: Delineates regions for offloading to devices via the
targetdirective. Operates with device handles and supports mapping clauses:map(to:), map(from:), map(tofrom:), and related variants. The region body is earmarked for compilation for accelerators.
1 2 3 4 |
%dev = omp.get_device
omp.target device(%dev) map(to: %a[0:100], from: %b[0:100]) {
...
} // omp.end target |
- omp.parallel / omp.parallel.do: Encapsulates parallel execution scopes (teams/threads). Supports attributes including thread count and binding policy; nested
omp.doallows specification of loop constructs, scheduling, reduction variables, and collapse levels.
1 2 3 4 5 |
omp.parallel num_threads(%nt) proc_bind(master) {
omp.do collapse(2) schedule(static) reduction(+:sum) {
...
}
} |
- omp.for / omp.do: Represents OpenMP-guided for-loops, functionally mapped to MLIR’s
scf.forwith attached OpenMP semantics (partitioning via schedule clauses). - omp.map_info and omp.bounds_info: Collect and provide mapping metadata (transfer direction, slicing bounds) for mapped objects, facilitating correct host-device data movement flows.
Each operation is structurally enforced via TableGen definitions and MLIR typing, supporting rigorous verification and transformation.
2. Lowering and Representation in MLIR IR
All OpenMP dialect operations reside within the upstream MLIR OpenMP dialect, defined as region operations—typically with zero SSA results—and enriched with attributes:
| MLIR Op | Attributes Example | Semantics |
|---|---|---|
| OMPTargetOp | device handle, mapping clauses | Accelerator offload region |
| OMPParallelOp | num_threads, proc_bind, nested loops | Spawning parallel teams/threads, binding mechanisms |
| OMPForOp | induction var, bounds, schedule, reduction | Partitioned loop across threads |
| OMPMapInfoOp | map_type, symbol, bounds_info | Host-device memory transfer intent |
Example IR prior to device/HLS lowering:
1 2 3 4 5 6 7 8 9 10 11 12 |
module {
%dev = omp.get_device
%mA = omp.map_info @A { map_type = #om.openmp.to, bounds = %bA }
%mB = omp.map_info @B { map_type = #om.openmp.from, bounds = %bB }
omp.target device(%dev) map(%mA, %mB) {
omp.parallel num_threads(%T) {
omp.for %i = %c0 to %c100 step %c1 {
...
}
}
}
} |
3. Integration with Device and HLS Dialects
A sequence of dedicated lowering passes transforms MLIR OpenMP dialect regions to device-specific and high-level synthesis (HLS) dialects for FPGA code generation:
- lower-omp-mapped-data: Consumes
omp.map_info, emitting operations such asdevice.alloc,memref.dma_start, andmemref.wait. Supports reference counting for implicit transfers, handling nesting correctly. - lower-omp-target-region: Decomposes
omp.targetintodevice.kernel_create(bundling as a kernel),device.kernel_launch, anddevice.kernel_wait; splits host/device IR and annotates device IR for the target environment (e.g.,attributes { target = "fpga" }). - lower-omp-loops-to-hls: Converts parallel+for nests into HLS dialect operations (
hls.interface,hls.pipeline,hls.unroll) adhering to scheduling and reduction clauses.
HLS dialect example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
module @kernel attributes { target = "fpga" } {
func @my_kernel(%A: memref<100xf32>, %B: memref<100xf32>) {
%p0 = hls.axi_protocol
hls.interface %A, %p0 { bundle = "gmem0" }
hls.interface %B, %p0 { bundle = "gmem1" }
scf.for %i = %c0 to %c100 step %c1 {
hls.pipeline(%c1)
%v = memref.load %A[%i]
%r = arith.addf %v, %c1 : f32
memref.store %r, %B[%i]
}
}
} |
4. Transformation Pipeline
The transformation pipeline for OpenMP-to-FPGA typically involves:
- Fortran IR lowering via Flang (FIR → MLIR).
- Canonicalization/map_rewrite to clean up MLIR.
- lower-omp-mapped-data for explicit host-device movement.
- lower-omp-target-region for kernel bundling/launch.
- ModulePartitionPass separating host/device IR.
- lower-omp-loops-to-hls: Mapping OpenMP nesting to HLS operations with vectorization, pipelining, and reduction expansion.
- lower-hls-to-func-call: Conversion to standard function call representation.
- MLIR→LLVM-dialect conversion (multiple passes).
- AMD-HLS-specific IR rewriting (e.g., for Vitis flow).
- Vitis HLS backend invocation to generate RTL/bitstream.
This pipeline enables leveraging standard OpenMP pragmas to target FPGAs and similar devices without requiring vendor-specific source-level annotations.
5. Handling of OpenMP Directives: Scheduling, Mapping, and Reductions
Manual OpenMP clauses map directly to MLIR dialect attributes, which are then consumed during lowering passes:
- Scheduling (
static,dynamic,simdlen): Encoded as attributes on parallel or for regions; guide transformation to pipelined/unrolled HLS loops or partitioned iteration spaces. - Mapping (
map(to:),map(from:),map(tofrom:),alloc, etc.): Manifest inomp.map_infoand influence emitted device/memory ops. Implicit transfer management ensures minimal data movement for nested/overlapped mappings. - Reduction (
reduction(+:sum)): Triggers creation and partitioning of accumulators per thread/lane; reduction combine is expressed in IR via loop over accumulator array:
1 2 3 4 5 6 |
%acc = constant 0
scf.for %t = 0 to %threads step 1 {
%tmp = memref.load %sum_parts[%t]
%acc = arith.addf %acc, %tmp
}
store %acc → @sum |
6. Extensibility, Customization, and MLIR Ecosystem Integration
All lowering passes in the pipeline utilize MLIR’s PatternRewriter infrastructure, enabling registration of custom rewrite patterns for:
- New OpenMP clauses (e.g.,
collapse,ordered, or vendor-specific extensions). - Device-specific HLS pragmas (
hls.stream,hls.partition, etc.). - Modifiable backend targets: swapping out runtime APIs for CUDA, ROCm, oneAPI by adjusting the device dialect or MLIR-to-C++ printer properties.
- Extendable bus protocol attributes (
hls.axi_protocol) to describe novel memory interconnects (e.g., AXI4-Lite, AXI4-Stream, PCIe).
The transformation system allows injection of auxiliary passes (vectorization, tiling, fusion) and hooks for backend data layout or target-specific optimization hints, illustrating the dialect’s composability within the MLIR ecosystem.
7. Comparison: UPIR MLIR Dialect Export and Unified Abstractions
The UPIR project demonstrates how OpenMP-style constructs are abstracted in the so-called "upir" dialect for unified parallel IR export to MLIR (Wang et al., 2022). It defines a set of region operations:
- upir.spmd: Forks teams/threads for a SPMD region.
- upir.loop / upir.loop_parallel: Describes, then distributes loops with scheduling, vectorization, or task-splitting clauses.
- upir.task: Expresses asynchronous regions, including device offload (OpenMP target →
upir.task offload(fpga:0)). - upir.data, upir.data_movement, upir.sync: Encapsulate mapping, updates, barriers, reductions, and other collectives in parallel regions.
Key attributes and constraints match MLIR’s strict typing, region structure, and verification. The ROSE front-end can emit UPIR, and any MLIR pass can consume and lower it to standard MLIR or OpenMP dialects, enabling uniform backend codegen. The UPIR approach emphasizes the dialect’s ability to capture parallel patterns—SPMD, data, and task parallelism—across multiple models, ensuring portability and composability.
8. Significance and Impact in Heterogeneous Compilation
The MLIR OpenMP dialect enables portable directive-based acceleration flows, particularly for FPGAs, by decomposing OpenMP annotated source into precisely-typed, semantically accurate IR. This allows:
- Separation of host and device codes at the IR level.
- Interoperation with standard MLIR passes and extensible lowering pipelines.
- Exploitation of explicit parallel and data mapping semantics for fine-grained optimization (data movement minimization, pipelining, partitioning).
- Consistent backend targeting for multiple device APIs and runtime conventions.
A notable result is that Fortran programmers, using familiar OpenMP pragmas, can generate FPGA-quality pipelines and bitstreams through standard compilation flows, leveraging both manual optimization through directives and sophisticated backend IR transformations. The dialect, as deployed in the reported pipeline, establishes a path for extensible, reusable compiler infrastructure for future heterogeneous HPC environments (Rodriguez-Canal et al., 11 Nov 2025), while UPIR illustrates the wider potential for unified parallel IR synthesis and lowering (Wang et al., 2022).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free