Papers
Topics
Authors
Recent
2000 character limit reached

GPU Kernel Equivalence Checker

Updated 23 November 2025
  • GPU kernel equivalence checking is a formal method to verify that two kernels produce identical outputs across all real-valued inputs.
  • VOLTA employs symbolic execution and a customized decision procedure to effectively detect data races, deadlocks, and semantic discrepancies in structured-CTAs.
  • The tool models thread-level and CTA-level operations rigorously, ensuring confluence, soundness, and completeness in verifying ML kernel implementations.

An equivalence checker for GPU kernels is a formal verification tool designed to determine whether two GPU kernels—often implementing the same mathematical operation via potentially optimized, transformed, or automatically-generated code—produce identical outputs for all possible real-valued inputs. With the proliferation of aggressive kernel optimization and the increasing generation of GPU code by LLMs and compilers, the need for formal guarantees of correctness has become acute. The equivalence checker VOLTA, introduced by Dubey et al., constitutes the first tool capable of sound and for a well-defined class of NVIDIA-style GPU kernels, complete verification of semantic equivalence in ML code such as convolutions, matrix multiplications, and attention mechanisms (Dubey et al., 16 Nov 2025).

1. Formal Model of GPU Kernels

The checker targets the thread-block or cooperative thread array (CTA), modeling it as an SPMD program with NN threads. Each thread possesses an individual register file R(i)R(i), while all threads share a memory map GG accessible through shared-memory addresses. A CTA state is thus defined as

Γ=(G:AddrVal,R:TIDRegVal,P:TIDThreadProg)\Gamma = (G : \text{Addr}\rightarrow\text{Val}, R : \text{TID}\rightarrow \text{Reg}\rightarrow\text{Val}, P : \text{TID} \rightarrow \text{ThreadProg})

Thread programs are "structured-CTA" straight-line code, including register writes, shared-memory reads and writes, and explicit synchronizations (barrier instructions sync II, where II is a subset of threads, followed by continuation TT). The checker’s semantics is formalized through:

  • Thread-level steps: Executing single instructions, subject to a no-race predicate that checks for safe concurrent access.
  • CTA-level scheduling: Interleaving thread steps or collective barrier firings, tracking synchronization and memory event contexts (X\mathcal{X}) to capture pending readers, writers, and sync-sets.

Any illegal interleaving leading to data races or deadlocks is detected by returning a special state \perp (undefined) (Dubey et al., 16 Nov 2025). This model differs radically from single-threaded scenarios by requiring explicit reasoning about concurrent memory access, communication, and barrier synchronization across thousands of threads.

2. Definition of Semantic Equivalence

Semantic equivalence of two kernels PP and QQ is defined operationally as follows: For every real-valued input (initial shared memory G0G_0, and registers R0R_0), both kernels must terminate under the semantics with identical final maps (GP=GQG_P = G_Q) and register contents (RP=RQR_P = R_Q). In formal notation,

PQ    θ: semantics(P)(θ)=semantics(Q)(θ)P \equiv Q \iff \forall \theta:\ \text{semantics}(P)(\theta) = \text{semantics}(Q)(\theta)

where θ\theta ranges over all initializations of inputs. The equivalence checker must guarantee this equality regardless of scheduling, provided both kernels are race-free and deadlock-free.

3. The Equivalence-Checking Algorithm: VOLTA

VOLTA implements equivalence checking in two main phases:

  • A. Symbolic Execution:
    • Initial values in shared and register memory are modeled as fresh symbolic variables (x1,x2,x_1, x_2, \ldots).
    • Non-deterministic round-robin thread scheduling is used. If semantics detect a race or deadlock (\perp), checking halts and the issue is reported. Because the concurrency semantics ensures confluence, all race-free schedules converge to the same symbolic final state.
    • Barrier instructions clear pending events in the memory context and facilitate correct synchronization tracking.
    • The final state yields, for each output location and thread, a symbolic expression ytid=Etid(x1,,xk)y_{\text{tid}} = E_{\text{tid}}(x_1,\ldots,x_k).
  • B. Symbolic Equality Checking:
    • For each output, the checker constructs verification conditions (VCs) of the form EtidP(x)=EtidQ(x)E^P_{\text{tid}}(x) = E^Q_{\text{tid}}(x).
    • These VCs are resolved into a canonical sum ipi(x)ehi(x)\sum_i p_i(x)e^{h_i(x)} and checked for identity via a procedure that reduces validity to the vanishing of the polynomials pi(x)p_i(x). This is computationally efficient for ML kernels.

A simplified pseudocode is provided in (Dubey et al., 16 Nov 2025), exhibiting initialization of symbolic state, iterative execution by threads, detection of illegal states, and final collection of symbolic outputs. The symbolic execution phase has complexity O(NP)O(N \cdot |P|), where P|P| is the kernel length, and the VC decision procedure scales with symbolic expression size.

4. Soundness, Completeness, and Supported Kernel Class

VOLTA’s semantics are equipped with confluence, soundness, and completeness properties:

  • Confluence: All schedules of race-free CTAs converge to a unique final symbolic state.
  • Soundness: If symbolic execution produces identical final states for PP and QQ and all VCs are valid, the kernels are guaranteed semantically equivalent for all real inputs.
  • Completeness: If P≢QP \not\equiv Q, symbolic execution either exhibits a race or the decision procedure finds a refuting input.

The kernel class supported (termed "structured-CTAs") reflects practical ML kernel structure: statically-known tensor and grid dimensions, straight-line per-thread code (loops/branches resolved at compile time), static memory access patterns, and synchronization using only block/warp-level barriers. Kernels violating these restrictions (e.g., dynamic control flow, non-deterministic accesses) are rejected (Dubey et al., 16 Nov 2025).

5. Implementation Details

VOLTA parses NVIDIA PTX (version 9.0) into a custom IR capturing explicit thread arithmetic and shared-memory operations. The simulator is prototyped in Python using SymPy for symbolic computation, with thread schedules and memory-event context precisely modeled. The decision procedure employs SymEngine and includes:

  • Combination and canonicalization of symbolic terms,
  • Exponential expression rewriting (eaebea+be^a \cdot e^b \rightarrow e^{a+b}),
  • Extraction of polynomial coefficients and specialized rewrites for operations like max\max and exp\exp,
  • Memoization and canonical ordering to expedite checks.

VCs are solved independently in parallel, enabling practical runtime for large kernels (Dubey et al., 16 Nov 2025).

6. Validation and Applications

VOLTA has been validated on a diverse suite of ML kernels:

Kernel Type Size / Feature Result
Harris's parallel-reduction (Red-1–4) Small, hand-tuned Verified (≃0.007s)
Harris's Red-5–7 Small, hand-tuned Races detected
Boehm's SGEMM (MatMul-1–7) Hand-optimized, varying techniques All race-free, up to 540s
FlashAttention (softmax) N=4, comparison to online variant 4 VCs, ≃0.1s
Full attention (FA1, FA1-TC, etc.) Large tensor heads, tensor-core usage 83–140s
LLM-generated 2D convolution 1,132 LOC PTX, 9,400 barriers 29s, found OOB reads
Claude Code GEMMs Three variants ≃200s each
TileLang tensor-cores 32×32×32–64×64×32 11–440s
FaialAA Comparison Avoids spurious races

In LLM-generated and code generator outputs, VOLTA has exposed subtle bugs invisible to hardware but significant for future architectures (e.g., out-of-bounds reads in shared memory). In pointer-indirect code, it distinguishes actual from spurious races, outperforming prior approaches.

7. Insights, Limitations, and Future Work

The core insight is that symbolic confluence yields deterministic summaries, rendering exhaustive schedule enumeration unnecessary. Precise detection of data races and deadlocks emerges from rigorous memory-event tracking. Deciding polynomial-exponential identities using dedicated normalization achieves decidability for the target kernel class; floating-point operations are modeled as arithmetic over R\mathbb{R} (aligning with ML’s –ffast-math practice).

Current limitations include lack of support for asynchronous DMA/pipeline intrinsics (e.g., tma, pipeline), the absence of floating-point error modeling beyond R\mathbb{R}, and scalability bottlenecks in the decision procedure for very large kernels with deep tiling hierarchies. Extending coverage to these areas and supporting richer kernel classes remain active directions (Dubey et al., 16 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Equivalence Checker for GPU Kernels.