Towards an Accurate GPU Data Race Detector

Published 2 Apr 2026 in cs.SE | (2604.02106v2)

Abstract: Data races in GPU programs pose a threat to the reliability of GPU-accelerated software stacks. Prior works proposed various dynamic (runtime) and static (compile-time) techniques to detect races in GPU programs. However, dynamic techniques often miss critical races, as they require the races to manifest during testing. While static ones can catch such races, they often generate numerous false alarms by conservatively assuming values of variables/parameters that cannot ever occur during any execution of the program. We make a key observation that the host (CPU) code that launches GPU kernels contains crucial semantic information about the values that the GPU kernel's parameters can take during execution. Harnessing this hitherto overlooked information helps accurately detect data races in GPU kernel code. We create HGRD, a new state-of-the-art static analysis technique that performs a holistic analysis of both CPU and GPU code to accurately detect a broad set of true races while minimizing false alarms. While SOTA dynamic techniques, such as iGUARD, miss many true races, HGRD misses none. On the other hand, static techniques such as GPUVerify and FaialAA raise tens of false alarms, where HGRD raises none.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a holistic static analysis framework that integrates host and device code to reduce spurious GPU data race detections.
It employs host code invariants such as asserts, launch geometry, and allocation constraints to prune infeasible executions.
The framework achieves superior accuracy by detecting all true races with zero false positives and negatives across 22 real-world CUDA programs.

Towards an Accurate GPU Data Race Detector

Motivation: GPU Data Race Detection and Its Challenges

The reliability of GPU-accelerated software hinges on the absence of low-level concurrency errors, among which data races are particularly insidious. These events, wherein two or more threads access shared data with at least one writer and insufficient synchronization, can lead to unpredictable and erroneous program outcomes. While both dynamic (runtime) and static (compile-time) approaches to race detection exist, their practical deployments are hampered by significant trade-offs. Dynamic analyses (e.g., iGUARD, HiRace) offer high specificity but inherently incomplete coverage—only manifesting races are detected—while imposing prohibitive runtime and memory overheads. Static tools (e.g., GPUVerify, FaialAA) produce soundness guarantees but are plagued by excessive false positives, as conservative program analysis drastically over-approximates feasible executions.

Critically, these prior works largely ignore a rich source of semantic constraints: the host (CPU) code that orchestrates kernel launches via parameter setting, memory allocation, and grid configuration. This paper observes that host code typically encodes invariants—via asserts, constrained loop bounds, parameter inter-dependencies, and allocation patterns—that logically restrict the input space of GPU kernels. Ignoring this information renders static analyses ineffective, as they must assume all parameter assignments to be feasible.

Host Code-Guided Static Analysis

This work proposes an advanced static analysis framework that is the first to undertake holistic reasoning over the joint host and device (GPU kernel) code. Analyzing the host code allows the extraction of five key classes of semantic constraints:

Host Asserts: Run-time assertions encode invariants over kernel parameters (e.g., that an input matrix is always square), preventing certain pathological values.
Launch Geometry: The host's selection of grid and block dimensions strictly limits the possible values of thread and block indices.
Parameter Relationships: Multiple kernel parameters are often derived from common host variables, thereby imposing non-trivial inter-variable relationships.
Loop-Bound Induced Ranges: Kernel parameters may be iteratively set via host-side loops, spatially or temporally bounding their feasible values.
Allocation-Based Constraints: Parameters determining allocation sizes (passed via cudaMalloc/cudaMallocPitch) must reflect physical allocation limits (e.g., strictly positive, bounded sizes).

Existing static analyzers, limited to kernel code, are oblivious to these relationships, leading to infeasible path exploration and spurious race reports. By systematic host code analysis, the proposed framework restricts the SAT solver's search space to feasible variable assignments.

For instance, consider a kernel parameterized by two matrix dimensions 'rows' and 'cols', where 'assert(rows == cols)' guards the kernel launch. Prior approaches must conservatively admit 'rows ≠ cols', while host code-guided analysis prunes such infeasible states, directly eliminating associated spurious races.

(Figure 1)

Figure 1: High-level system architecture showing host and device code analysis; colored components illustrate new host code-driven features integrated by the proposed framework.

Design and Implementation

The analysis pipeline leverages CGeist and MLIR to process CUDA source code into a form amenable to structural examination and constraint extraction. It consists of the following core components:

Host Code Analyzers: Traverse the host code to construct expression trees reflecting parameter dependencies, assertion constraints, and infer allocation-induced bounds.
Kernel Access Pair Generator: Identifies all potentially conflicting pairs of memory operations in the GPU kernel, subject to traditional constraints (aliasing, thread block, thread identity).
Constraint Synthesis: Encodes both kernel-derived and host-derived constraints into a satisfiability query, dramatically reducing the space of feasible races.
Advanced Synchronization Reasoner: Accurately models fine-grained synchronization, including acquire-release patterns and intra-warp, intra-block, inter-block barriers.
SAT-based Conflict Detection: Uses CP-SAT (or Z3) to efficiently propagate constraints and identify genuine conflicting memory instruction pairs.
Synchronization Validation: For each putative race, determines whether synchronization constructs trivially prevent the race, and, if not, issues precise diagnostic output.

The framework also supports scoped synchronization semantics and is capable of distinguishing between races at different levels of the GPU memory hierarchy (global, shared), including those introduced by modern warp scheduling and fine-grained atomic/fence primitives.

Key Results

Extensive evaluation on 22 real-world CUDA programs demonstrates that the new approach consistently detects all true races without any false alarms or missed cases. In the evaluation set, dynamic approaches (iGUARD, GKLEE) missed numerous races (false negatives) since manifestation requires specific thread schedules and input data. Meanwhile, static tools (GPUVerify, FaialAA) generated false positives at alarming rates (up to ~79% of reported races), severely reducing practical usability.

In contrast, the host code-guided analysis yielded zero false positives and zero false negatives, outperforming all tested competitors. Notably, it is the first static framework to reliably identify intra-warp races and to handle acquire-release synchronization without the need for aggressive program rewriting or manual annotation.

Analysis of Host Code Constraint Classes

A detailed breakdown of the program set indicates that every semantic host code class eliminates specific families of false positives. For example, by leveraging asserts, certain race conditions that are feasible in kernel code alone are statically proven infeasible; grid dimension constraints eliminate inter-block races that cannot occur due to launch geometry; loop-induced bounds and allocation constraints further restrict infeasible memory accesses across a range of benchmarks.

(Figure 2)

Figure 2: Expression tree automatically constructed for an assert condition, statically relating kernel parameters and enabling constraint-based pruning of infeasible executions.

Implications and Future Directions

This framework illuminates the necessity of holistic, program-level static analysis for effective race detection in heterogeneous systems. The results indicate that most false positives in static analysis stem not from poor modeling of GPU execution, but from the inability to leverage invariant knowledge embedded in host code.

For software reliability researchers and tool builders, this work recalibrates the boundary of analyses: only by integrating host-level insight can sound and useful static GPU race detectors be constructed. The approach is inherently extensible—future directions include (1) deeper integration with mixed host-device pointer tracking for complex aliasing, (2) generalization to other heterogeneous models (e.g., OpenCL, HIP), and (3) refinement for path-sensitive analysis guided by interprocedural call graphs or value flow.

In practice, this framework provides a pathway toward deployment of static analyzers in production, with zero runtime cost and maximal actionable accuracy. Integration with existing compiler infrastructures, such as LLVM/MLIR, further facilitates adoption.

Conclusion

By unifying host and kernel code analysis, the proposed static approach redefines the state of the art in GPU data race detection, eliminating spurious errors endemic to prior static approaches and uncovering subtle races missed by dynamic analysis. The results underscore that extracting and enforcing executable semantic contracts from host code is mandatory for scalable, accurate static analysis of heterogeneous programs. This has significant implications not only for debugging, but also for program synthesis, verification, and automated repair in the GPU computing domain.

Markdown Report Issue