OMPDataPerf: Profiling & Optimizing Data Mapping

Updated 26 January 2026

OMPDataPerf is a framework of principles, tools, and metrics aimed at profiling and optimizing data mapping in heterogeneous OpenMP offload applications.
It employs dynamic analysis using OMPT callbacks and static source-to-source transformations to detect inefficient data transfers and redundant memory allocations.
The methodology enables precise performance assessment and practical code optimizations, delivering significant speedups and reduced data movement overhead.

OMPDataPerf encompasses a suite of principles, tools, and metrics for profiling and optimizing data mapping in heterogeneous OpenMP offload applications. It is formally defined by dynamic analysis techniques capable of detecting inefficient data transfer and allocation patterns, as well as by static source-to-source transformation frameworks that minimize host-device communication. The OMPDataPerf methodology is realized in dynamic profilers such as OMPDataPerf itself (Marzen et al., 19 Jan 2026), and in static analyzers like OMPDart (Marzen et al., 2024). These approaches enable quantification, attribution, and reduction of data movement overhead in CPU–accelerator systems, supporting both high-level performance assessment and concrete code transformation.

1. Architectural Principles

OMPDataPerf's core architecture is characterized by low-overhead collection and post-mortem analysis of heterogeneous data-movement events. In dynamic profiling, OMPDataPerf is implemented as an LD-preloaded or linked shared library that registers for a minimal set of OpenMP Tools Interface (OMPT) callbacks within the OpenMP runtime. During execution, it logs every data transfer and allocation, as well as kernel launches and returns, without injecting instrumentation into user code (Marzen et al., 19 Jan 2026).

Internally, two principal data structures support scalability and analysis:

Chronological event log: Every OMPT "target data op" or "target EMI" event is recorded with timestamp, device ID, address range, byte count, and a content hash.
Multimap-style index: Operations are grouped by buffer identifiers and device, enabling rapid detection of repeated or redundant activity.

Following program completion, a post-mortem pass analyzes these logs for inefficient data mapping patterns. This two-phase architecture—event logging at runtime, analysis after completion—maintains modest runtime and space overhead (5% geometric-means slowdown, event logs ~43 KB/s) (Marzen et al., 19 Jan 2026).

In the static paradigm, OMPDart applies context-sensitive, interprocedural, flow-sensitive analysis across C/C++ OpenMP applications. It constructs code property graph–style hybrid AST-CFGs, propagates memory-access summaries through the call graph, and builds precise host-device dependency relationships for each target region (Marzen et al., 2024).

2. Data Movement Instrumentation and Profiling

Dynamic OMPDataPerf instrumentation exclusively leverages OMPT EMI callbacks as defined in OpenMP 5.1. These are:

ompt_callback_target_emi (kernel submit and completion)
ompt_callback_target_data_op_emi (data alloc/copy/free start and end)

Each EMI event records:

High-precision timestamps (OMPT timer)
Device identifiers and pointers
Byte counts
Content hashes (default: t1ha0_avx2, selected for performance)

Start–end pairing of these callbacks enables exact calculation of elapsed time for every data operation, allocation, or kernel event without further instrumentation.

Profile metrics are derived as follows:

transfer_time = (end_time − start_time) for each data transfer
alloc_overhead = (end_time − start_time) for each allocation
total_transfer_volume = Σ bytes over all host↔device copies
total_alloc_volume = Σ bytes for device allocations

Output formats include live summary tables, line-attributed lists of hot spots, and optional CSV traces for external visualization (Marzen et al., 19 Jan 2026). With libdw and debug builds, full source-line annotation is supported.

3. Detection, Analysis, and Optimization Algorithms

Detection of inefficient mapping patterns is central to OMPDataPerf. Using linear/near-linear algorithms over event logs, the profiler identifies:

Duplicate Transfers: Groups host↔device copies by (hash, device); multiple entries flag redundancy.
Round-Trip Transfers: Sequential host→device and device→host copies recognize wasted round-trips if data is not used or updated in between.
Repeated Allocations: Multiple alloc/dealloc cycles on identical (host_ptr, device, size) indicate unnecessary persistence loss.
Unused Mappings: Data allocations or transfers whose lifetimes do not overlap any kernel execution, or are overwritten before use.

Unused allocations are formally defined as memory that does not overlap any kernel execution on a given device. Unused host→device copies are those overwritten (by other copies to the same host address) before kernel use, or occurring after kernel completion (Marzen et al., 19 Jan 2026).

Static analysis in OMPDart infers the minimal movement set $D_\text{to}$ and $D_\text{from}$ by true (RAW) dependencies:

$D_\text{to} = \bigcup_{r \in R} \text{Live}_\text{in}(r)$

$D_\text{from} = \bigcup_{r \in R} \text{Live}_\text{out}(r)$

Resulting mapping clauses are inserted at region boundaries, while updates are hoisted out of loops if index subranges are invariant, minimizing amortized communication cost:

$\text{Cost}_{\text{comm}}(v, p) = \alpha + \beta \cdot \text{Size}(v_\text{subrange})$

Firstprivate(x) clauses are favored for read-only scalars to bypass explicit memcpy overhead, exploiting kernel argument transmission (Marzen et al., 2024).

4. Mathematical Model for Optimization Potential

OMPDataPerf predicts optimization gains via a cost model based on profiled inefficiencies. Let $T_\text{total}$ denote baseline execution time, $T_\text{redund}$ the total time spent in flagged inefficient transfers and allocations:

$T_\text{opt} \simeq T_\text{total} - T_\text{redund}$

The predicted speedup is then

$S_\text{pred} = \frac{T_\text{total}}{T_\text{total} - T_\text{redund}}$

The model supports small corrections for asynchronous transfers or multiple devices, but the core formula holds for most scenarios. In practical evaluations, mean relative error in predicted speedups was 14% (MSE 0.17) (Marzen et al., 19 Jan 2026).

5. Evaluation Results and Case Studies

Extensive evaluation of OMPDataPerf across heterogeneous benchmarks (Rodinia bfs, hotspot, lud, nw; BabelStream; Mantevo minife; Bristol minifmm; tealeaf; rsbench; xsbench) yielded the following results (Marzen et al., 19 Jan 2026):

Runtime overhead: Geometric mean 5%, maximum 33%
Space overhead: Typically a few MiB per benchmark; ≈ 43 KB/s trace logging
Patterns detected: Hundreds of duplicate transfers (BabelStream, Tealeaf), several round trips and repeated allocations (BFS, minife, bspline-vgh)

Corrective measures based on profiler output resulted in significant speedup:

Map clause relocation in Rodinia bfs: 2.1× improvement
Persistent data lifetimes in minife: 1.07× speedup
Copy hoisting in bspline-vgh: 1.14× speedup, 99% reduction in redundant copies

Static OMPDart yielded comparable or superior results on nine HPC kernels:

Data-transfer reduction: 2.1 GB geometric mean
Speedup over unoptimized: 2.8× geometric mean
Speedup over expert hand-tuned code: 1.05× geometric mean
LULESH: 85% less redundant traffic, 1.6× faster than human expert tuning (Marzen et al., 2024)

6. Best Practices and Limitations

Adoption of OMPDataPerf requires OMPT-enabled OpenMP runtimes and adherence to certain development practices:

Recommended guidelines:

Build with OMPT-enabled OpenMP runtime, link/preload OMPDataPerf early
Prioritize issues constituting >5% of total runtime at singular hotspots
Apply predicted speedup models as heuristics, acknowledging typical 5–20% error margins
Aggregate kernels in single target-data regions, favor explicit map(to:) clauses, hoist data copies outside loops, maximize buffer reuse across kernels

For static optimization:

Keep kernels and helpers in the same translation unit
Declare variables before their first target region
Use canonical for-loop constructs for multidimensional bound analysis
Prefer read-only scalars with firstprivate
Group kernels to enable region merging and reduced alloc/free cycles

Limitations:

Multi-TU programs force worst-case static analysis assumptions
Conservative aliasing may overapproximate movement set
While/do loops lack full symbolic bound analysis; updates default to whole arrays
Intentional data staleness may be undone by automatic updates, requiring manual intervention
Simple speedup models may overestimate gain in heavily asynchronous contexts or multi-GPU deployments (Marzen et al., 19 Jan 2026, Marzen et al., 2024)

7. Significance in Heterogeneous Computing

OMPDataPerf methodologies automate detection and remediation of low-efficiency data mapping, integrating dynamic profiling and static optimization for heterogeneous OpenMP applications. They minimize the need for manual intervention in profiling and code transformation, delivering line-attributed diagnostics, predicted runtimes, and direct edit suggestions. Evaluations confirm substantial performance gains, often surpassing expert-tuned manual optimizations. This suggests that OMPDataPerf constitutes a best-practice reference methodology for accelerator-bound HPC codes relying on OpenMP offload, especially when iterative data mapping refinement is required for production deployment (Marzen et al., 19 Jan 2026, Marzen et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Dynamic Detection of Inefficient Data Mapping Patterns in Heterogeneous OpenMP Applications (2026)

Static Generation of Efficient OpenMP Offload Data Mappings (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OMPDataPerf.