HeCBench: Heterogeneous Compute Benchmark

Updated 10 January 2026

HeCBench is a community-curated benchmark suite comprising 749 GPU-accelerated programs that represent diverse HPC mini-applications and parallel workloads.
It provides detailed profiling metrics including FLOP counts and roofline classifications to support empirical studies in performance modeling and compiler optimization.
The suite underpins research into code generation, auto-parallelization, and LLM-based analysis, offering a canonical dataset for rigorous methodological validation.

HeCBench is a community-curated benchmark suite specifically constructed to facilitate research into heterogeneous and parallel computing. It supplies a broad and diverse set of small-to-medium-sized programs and GPU kernels, primarily targeting empirical evaluation and methodological advancement in areas such as code generation, performance modeling, compiler optimization, and LLM-based reasoning about parallel program behavior. HeCBench is widely used as a source of real-world, representative computational kernels and as a canonical ground-truth dataset for algorithm and model validation.

1. Suite Composition and Coverage

HeCBench comprises 749 distinct GPU-accelerated programs spanning a comprehensive array of high-performance computing (HPC) mini-applications. Its domain coverage includes, but is not limited to:

Stencil codes (e.g., Jacobi, heat diffusion)
Dense and sparse linear algebra (vector addition, reduction, GEMM)
Fast Fourier Transform (FFT) kernels
N-body and particle simulation
Sorting routines
Graph algorithms (reduce, scan)
Image processing (2D/3D convolution, filtering)
Physics toy models (e.g., LULESH subroutines)

Programming model variants provided include CUDA and OpenMP device-offload, with additional support for SYCL and HIP in certain kernels (Bolet et al., 6 May 2025, Bolet et al., 4 Dec 2025, Kadosh et al., 2024). The OpenMP subset is equipped with a uniform build system ("make DEVICE=cpu") and reference check harnesses to ensure correctness across different computational environments (Kadosh et al., 2024).

2. Methodological Role in Empirical Studies

HeCBench serves as both a source of realistic computational kernels and as a ground-truth corpus for rigorous evaluation in diverse research methodologies:

Performance-portability studies: Used for roofline modeling and the classification of compute-bound vs. bandwidth-bound kernels (Bolet et al., 6 May 2025).
Compiler and DSL validation: Supports evaluations involving auto-tuning frameworks (Halide, TVM), source-to-source translators, and parallelizing toolchains (Bolet et al., 4 Dec 2025).
Auto-parallelization and LLM-based code analysis: Forms the core of benchmarking platforms for tools such as OMPar—the AI-driven OpenMP parallelizer (Kadosh et al., 2024).

Due to its diversity and coverage, HeCBench is repeatedly selected as the empirical backbone for studies concerned with performance prediction ("can LLMs predict GPU behavior?"), FLOP counting without execution, and automatic synthesis of concurrency constructs.

3. Profiling and Ground-Truth Labeling Protocols

To enable reliable benchmarking, kernels within HeCBench are empirically profiled under uniform hardware and software conditions. In evaluation contexts focusing on GPU performance (e.g., roofline analysis or flop-count estimation), the following protocols have been applied (Bolet et al., 6 May 2025, Bolet et al., 4 Dec 2025):

Execution environment: NVIDIA RTX 3080 (10 GB GDDR6), CUDA 12.6, built with clang 18.1.3, typically using -O3 and with fast-math explicitly disabled.
Metrics collected: Per-kernel single-precision and double-precision FLOP counts, integer operation counts, execution time, bytes read/written from global memory.
Selection criteria: Only kernels that compile, execute successfully, and fit within context limits (commonly <8,000 to <30,000 tokens) are retained.
Roofline labeling: Arithmetic intensity (AI) calculated as AI = total FLOPs / total bytes transferred; classifies kernels relative to device compute vs. bandwidth ceilings:

$\mathrm{Attainable\ Performance} = \min\left(\mathrm{Peak\ FLOP/s},\; \mathrm{AI} \times \mathrm{Peak\ BW}\right)$

Compute-bound (CB) or bandwidth-bound (BB) designations are assigned based on whether the kernel's AI is above or below the measured balance point for each operation type.

Annotation pipelines also support finer-grained labeling, such as the identification of hidden FLOP sources (divisions, math intrinsics, library calls, etc.) facilitating "easy" vs. "hard" subsets for static code analysis (Bolet et al., 4 Dec 2025).

4. Role in LLM Evaluation

HeCBench is extensively leveraged in recent studies probing the capabilities of LLMs to reason about parallel code complexity and performance. Its kernels support several distinct evaluation protocols:

Source-level roofline classification: Balanced datasets of 340 kernels (85 each in CUDA-BB, CUDA-CB, OMP-BB, OMP-CB) are used to benchmark LLMs for classification accuracy across scenarios—explicit profiling, zero-shot, few-shot, and fine-tuning. Reasoning-capable LLMs can achieve 100% accuracy with input profiling but only ≈64% when restricted to source code and hardware specs, highlighting static analysis challenges (Bolet et al., 6 May 2025).
FLOP count estimation (“counting without running”): 577 CUDA kernels (representing the full spectrum of HeCBench’s computational variety) are profiled to obtain exact single/double-precision FLOP counts and annotated with eight execution attributes relating to warp divergence, division, intrinsic math, recursion, and other properties (Bolet et al., 4 Dec 2025). LLM performance is evaluated for both direct (easy) and hidden (hard) workload estimation challenges.

Failure cases systematically cluster in kernels with implicit FLOP contributions or ambiguous memory-compute characteristics, indicating current LLMs’ limitations in internalizing hardware-specific microoperation semantics.

5. Benchmarking Automatic Parallelization Tools

HeCBench is used as a reference workload for the quantitative and qualitative assessment of automatic source-to-source parallelization systems:

OMPar evaluation: 770 loops extracted from 175 distinct benchmarks (223 passing compile-and-run verification) serve as the main dataset for evaluating OMPify (parallelization candidate detector) and MonoCoder-OMP (automatic pragma generator). Ground-truth is established using both manual labels and reference outputs, with extensive statistics reported (precision ≈0.93, recall ≈0.85, F₁ ≈0.89 in compile-and-run scenarios) (Kadosh et al., 2024).
Comparative analysis: Legacy tools such as AutoPar and ICPC are benchmarked against OMPar using identical HeCBench samples, highlighting OMPar’s improved loop coverage, correctness, and speedup distributions.

Key kernel types in this context include convolutions, stencils, reductions, sparse-matrix computations, and non-affine index patterns, reflecting HeCBench’s representative coverage of real-world CPU/GPU-parallel workloads.

6. Dataset Construction and Selection Criteria

HeCBench-based evaluation pipelines universally impose strict selection and filtering criteria before inclusion of kernels in benchmarking datasets:

Criterion	Description	Papers Utilizing
Build and run verification	Only compiled kernels with successful execution retained	(Bolet et al., 6 May 2025, Bolet et al., 4 Dec 2025, Kadosh et al., 2024)
Context window limits	Kernels exceeding practical LLM prompt context excluded	(Bolet et al., 6 May 2025, Bolet et al., 4 Dec 2025)
First-invocation only	Multi-launch codes profiled on first kernel invocation	(Bolet et al., 4 Dec 2025)
Annotation	Manual/expert labeling for task-specific attributes	(Bolet et al., 4 Dec 2025)

This selection ensures empirical consistency and enables fair benchmarking of model- and tool-driven analyses.

7. Limitations and Future Directions

HeCBench’s influence on the HPC and AI benchmarking landscape is substantial, but certain limitations and avenues for extension are recognized:

Dataset granularity/scope: LLM fine-tuning on HeCBench is bottlenecked by sample diversity and volume; results suggest expansion across broader kernel types and multi-architecture profiling is required for robust learning (Bolet et al., 6 May 2025).
Modeling hidden FLOPs: Hardware-specific micro-op expansions (e.g., floating-point division decompositions, intrinsic math implementations) and dynamic behaviors (data-dependent branches, recursion) challenge both LLMs and standard static analyses; richer annotations and integrated hybrid analysis approaches are a promising path forward (Bolet et al., 4 Dec 2025).
Representation and automation: As HeCBench continues to expand in both scope and format, its canonical role as a curated, validated source of heterogeneous kernels positions it as a critical resource for future advances in static analysis, code generation, and automated optimization.

HeCBench and its derivative datasets are publicly available, with associated code and experimental results supporting community-driven extension and cross-validation (Bolet et al., 6 May 2025, Bolet et al., 4 Dec 2025, Kadosh et al., 2024).