Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

MLIR OpenMP Dialect

Updated 13 November 2025
  • MLIR OpenMP Dialect is a structured intermediate representation that encapsulates OpenMP's directive-based parallelism for compiler analysis, transformation, and lowering.
  • It supports key constructs such as omp.target, omp.parallel, and omp.for to facilitate device offloading and integration with HLS dialects for efficient FPGA code generation.
  • Its transformation pipeline leverages canonical MLIR passes to decompose parallel regions into device-specific operations, optimizing performance for heterogeneous systems.

The MLIR OpenMP dialect is a structured intermediate representation (IR) within the Multi-Level Intermediate Representation (MLIR) framework, which encodes OpenMP’s directive-based parallelism at the IR level for compiler analysis, transformation, and lowering. It encapsulates parallel regions, offloading constructs, and data mappings found in OpenMP, providing an abstraction layer that facilitates integration, composable transformations, and targeting of heterogeneous devices, including Field Programmable Gate Arrays (FPGAs) through further lowering passes. Recent research demonstrates comprehensive pipelines that utilize this dialect to enable efficient code offloading—most notably for Fortran via Flang—to FPGAs, with robust support for both standard OpenMP semantics and extension points for device-specific optimizations (Rodriguez-Canal et al., 11 Nov 2025). Additionally, the UPIR approach illustrates how OpenMP constructs are abstracted for unified parallel IR export to MLIR (Wang et al., 2022).

1. Core Operations and Semantics

The MLIR OpenMP dialect defines a set of canonical region-based operations, each corresponding to OpenMP constructs:

  • omp.target: Delineates regions for offloading to devices via the target directive. Operates with device handles and supports mapping clauses: map(to:), map(from:), map(tofrom:), and related variants. The region body is earmarked for compilation for accelerators.

1
2
3
4
%dev = omp.get_device
omp.target device(%dev) map(to: %a[0:100], from: %b[0:100]) {
  ...
} // omp.end target

  • omp.parallel / omp.parallel.do: Encapsulates parallel execution scopes (teams/threads). Supports attributes including thread count and binding policy; nested omp.do allows specification of loop constructs, scheduling, reduction variables, and collapse levels.

1
2
3
4
5
omp.parallel num_threads(%nt) proc_bind(master) {
  omp.do collapse(2) schedule(static) reduction(+:sum) {
    ...
  }
}

  • omp.for / omp.do: Represents OpenMP-guided for-loops, functionally mapped to MLIR’s scf.for with attached OpenMP semantics (partitioning via schedule clauses).
  • omp.map_info and omp.bounds_info: Collect and provide mapping metadata (transfer direction, slicing bounds) for mapped objects, facilitating correct host-device data movement flows.

Each operation is structurally enforced via TableGen definitions and MLIR typing, supporting rigorous verification and transformation.

2. Lowering and Representation in MLIR IR

All OpenMP dialect operations reside within the upstream MLIR OpenMP dialect, defined as region operations—typically with zero SSA results—and enriched with attributes:

MLIR Op Attributes Example Semantics
OMPTargetOp device handle, mapping clauses Accelerator offload region
OMPParallelOp num_threads, proc_bind, nested loops Spawning parallel teams/threads, binding mechanisms
OMPForOp induction var, bounds, schedule, reduction Partitioned loop across threads
OMPMapInfoOp map_type, symbol, bounds_info Host-device memory transfer intent

Example IR prior to device/HLS lowering:

1
2
3
4
5
6
7
8
9
10
11
12
module {
  %dev = omp.get_device
  %mA = omp.map_info @A { map_type = #om.openmp.to, bounds = %bA }
  %mB = omp.map_info @B { map_type = #om.openmp.from, bounds = %bB }
  omp.target device(%dev) map(%mA, %mB) {
    omp.parallel num_threads(%T) {
      omp.for %i = %c0 to %c100 step %c1 {
        ...
      }
    }
  }
}

3. Integration with Device and HLS Dialects

A sequence of dedicated lowering passes transforms MLIR OpenMP dialect regions to device-specific and high-level synthesis (HLS) dialects for FPGA code generation:

  • lower-omp-mapped-data: Consumes omp.map_info, emitting operations such as device.alloc, memref.dma_start, and memref.wait. Supports reference counting for implicit transfers, handling nesting correctly.
  • lower-omp-target-region: Decomposes omp.target into device.kernel_create (bundling as a kernel), device.kernel_launch, and device.kernel_wait; splits host/device IR and annotates device IR for the target environment (e.g., attributes { target = "fpga" }).
  • lower-omp-loops-to-hls: Converts parallel+for nests into HLS dialect operations (hls.interface, hls.pipeline, hls.unroll) adhering to scheduling and reduction clauses.

HLS dialect example:

1
2
3
4
5
6
7
8
9
10
11
12
13
module @kernel attributes { target = "fpga" } {
  func @my_kernel(%A: memref<100xf32>, %B: memref<100xf32>) {
    %p0 = hls.axi_protocol
    hls.interface %A, %p0 { bundle = "gmem0" }
    hls.interface %B, %p0 { bundle = "gmem1" }
    scf.for %i = %c0 to %c100 step %c1 {
      hls.pipeline(%c1)
      %v = memref.load %A[%i]
      %r = arith.addf %v, %c1 : f32
      memref.store %r, %B[%i]
    }
  }
}

4. Transformation Pipeline

The transformation pipeline for OpenMP-to-FPGA typically involves:

  1. Fortran IR lowering via Flang (FIR → MLIR).
  2. Canonicalization/map_rewrite to clean up MLIR.
  3. lower-omp-mapped-data for explicit host-device movement.
  4. lower-omp-target-region for kernel bundling/launch.
  5. ModulePartitionPass separating host/device IR.
  6. lower-omp-loops-to-hls: Mapping OpenMP nesting to HLS operations with vectorization, pipelining, and reduction expansion.
  7. lower-hls-to-func-call: Conversion to standard function call representation.
  8. MLIR→LLVM-dialect conversion (multiple passes).
  9. AMD-HLS-specific IR rewriting (e.g., for Vitis flow).
  10. Vitis HLS backend invocation to generate RTL/bitstream.

This pipeline enables leveraging standard OpenMP pragmas to target FPGAs and similar devices without requiring vendor-specific source-level annotations.

5. Handling of OpenMP Directives: Scheduling, Mapping, and Reductions

Manual OpenMP clauses map directly to MLIR dialect attributes, which are then consumed during lowering passes:

  • Scheduling (static, dynamic, simdlen): Encoded as attributes on parallel or for regions; guide transformation to pipelined/unrolled HLS loops or partitioned iteration spaces.
  • Mapping (map(to:), map(from:), map(tofrom:), alloc, etc.): Manifest in omp.map_info and influence emitted device/memory ops. Implicit transfer management ensures minimal data movement for nested/overlapped mappings.
  • Reduction (reduction(+:sum)): Triggers creation and partitioning of accumulators per thread/lane; reduction combine is expressed in IR via loop over accumulator array:

1
2
3
4
5
6
%acc = constant 0
scf.for %t = 0 to %threads step 1 {
  %tmp = memref.load %sum_parts[%t]
  %acc = arith.addf %acc, %tmp
}
store %acc → @sum

6. Extensibility, Customization, and MLIR Ecosystem Integration

All lowering passes in the pipeline utilize MLIR’s PatternRewriter infrastructure, enabling registration of custom rewrite patterns for:

  • New OpenMP clauses (e.g., collapse, ordered, or vendor-specific extensions).
  • Device-specific HLS pragmas (hls.stream, hls.partition, etc.).
  • Modifiable backend targets: swapping out runtime APIs for CUDA, ROCm, oneAPI by adjusting the device dialect or MLIR-to-C++ printer properties.
  • Extendable bus protocol attributes (hls.axi_protocol) to describe novel memory interconnects (e.g., AXI4-Lite, AXI4-Stream, PCIe).

The transformation system allows injection of auxiliary passes (vectorization, tiling, fusion) and hooks for backend data layout or target-specific optimization hints, illustrating the dialect’s composability within the MLIR ecosystem.

7. Comparison: UPIR MLIR Dialect Export and Unified Abstractions

The UPIR project demonstrates how OpenMP-style constructs are abstracted in the so-called "upir" dialect for unified parallel IR export to MLIR (Wang et al., 2022). It defines a set of region operations:

  • upir.spmd: Forks teams/threads for a SPMD region.
  • upir.loop / upir.loop_parallel: Describes, then distributes loops with scheduling, vectorization, or task-splitting clauses.
  • upir.task: Expresses asynchronous regions, including device offload (OpenMP target → upir.task offload(fpga:0)).
  • upir.data, upir.data_movement, upir.sync: Encapsulate mapping, updates, barriers, reductions, and other collectives in parallel regions.

Key attributes and constraints match MLIR’s strict typing, region structure, and verification. The ROSE front-end can emit UPIR, and any MLIR pass can consume and lower it to standard MLIR or OpenMP dialects, enabling uniform backend codegen. The UPIR approach emphasizes the dialect’s ability to capture parallel patterns—SPMD, data, and task parallelism—across multiple models, ensuring portability and composability.

8. Significance and Impact in Heterogeneous Compilation

The MLIR OpenMP dialect enables portable directive-based acceleration flows, particularly for FPGAs, by decomposing OpenMP annotated source into precisely-typed, semantically accurate IR. This allows:

  • Separation of host and device codes at the IR level.
  • Interoperation with standard MLIR passes and extensible lowering pipelines.
  • Exploitation of explicit parallel and data mapping semantics for fine-grained optimization (data movement minimization, pipelining, partitioning).
  • Consistent backend targeting for multiple device APIs and runtime conventions.

A notable result is that Fortran programmers, using familiar OpenMP pragmas, can generate FPGA-quality pipelines and bitstreams through standard compilation flows, leveraging both manual optimization through directives and sophisticated backend IR transformations. The dialect, as deployed in the reported pipeline, establishes a path for extensible, reusable compiler infrastructure for future heterogeneous HPC environments (Rodriguez-Canal et al., 11 Nov 2025), while UPIR illustrates the wider potential for unified parallel IR synthesis and lowering (Wang et al., 2022).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MLIR OpenMP Dialect.