KernelBench: Evaluating Computational Kernels

Updated 22 July 2025

KernelBench is a suite of benchmark frameworks and methodologies that assess the speed and correctness of computational kernels across various domains.
It integrates standardized benchmarks with innovative techniques such as iterative refinement and reinforcement learning to optimize kernel performance.
The framework provides actionable insights by measuring speedup, portability, and accuracy, influencing practical improvements in GPU and CPU optimizations.

KernelBench is an umbrella term that refers to a constellation of benchmark suites, methodologies, and frameworks designed for evaluating, analyzing, and enhancing the performance and correctness of computational kernels across a range of domains—including lattice Boltzmann methods, GPU and CUDA optimization, automated kernel generation by LLMs, and automated kernel tuning. Recent usages focus on evaluating the ability of advanced LLMs and autotuning algorithms to generate, optimize, and reason about efficient kernels in real machine learning and scientific workloads (Ouyang et al., 14 Feb 2025, Tørring et al., 2023, Li et al., 8 Jul 2025, Li et al., 18 Jul 2025).

1. Definition and Scope

KernelBench collectively denotes open-source frameworks and benchmark suites for assessing both the speed and functional correctness of computational kernels—where a “kernel” is a code segment implementing a core, high-performance operation on hardware such as CPUs or GPUs. Modern iterations of KernelBench target deep learning and scientific computing workloads, drawing their tasks from widely used libraries (e.g., PyTorch) and involve a range of computation patterns (from matrix multiplications to end-to-end model blocks).

KernelBench’s objectives are to:

Provide standardized, challenging benchmarks reflective of real-world performance bottlenecks.
Enable the evaluation of optimization strategies, autotuners, and LLM-generated code across a spectrum of kernel difficulties and hardware architectures.
Serve as a measurement tool for both correctness and speedup, with metrics that directly translate to practical engineering progress (Ouyang et al., 14 Feb 2025, Tørring et al., 2023).

Historically, the term has also applied to specialized suites for domains like lattice Boltzmann CFD solvers (Wittmann et al., 2017), and dynamic kernel detection in application traces (Uhrie et al., 2020), but recent focus is on GPU kernel optimization and machine learning workloads.

2. Benchmark Suite Construction

Recent incarnations of KernelBench define benchmarks as curated sets of computation tasks (kernels), each paired with a reference implementation (often PyTorch code):

Levels of difficulty: Typically, tasks are divided into three tiers—Level 1 (single primitive operation, such as a convolution), Level 2 (operation sequences requiring fusion or joint optimization), and Level 3 (subgraphs or even entire deep learning models) (Ouyang et al., 14 Feb 2025).
Realistic coverage: Kernels in these suites include GEMM (general matrix multiplication), N-body calculations, image convolutions, Hotspot (thermal simulation), expdist (microscopy), dedispersion (radio astronomy), and point-in-polygon tests for geospatial databases (Tørring et al., 2023). For LLM-based kernel generation, 250 tasks drawn from PyTorch workloads cover the operational spectrum of HPC and AI engineering (Ouyang et al., 14 Feb 2025).

All benchmarks are structured to allow fine-grained performance and correctness evaluation—kernels are run, validated, and profiled in a standardized pipeline that reproduces the engineering workflow of optimizing computational hot spots.

3. Evaluation Metrics and Methods

Evaluation on KernelBench involves stringent, multi-dimensional metrics:

Speedup and Correctness via fastₚ: The principal metric, $fast_p$ , is the percentage of tasks where the generated kernel is both functionally correct and achieves speedup $>p$ over a baseline (often PyTorch Eager mode). The metric is

$fast_p = \frac{1}{N} \sum_{i=1}^N \mathbb{1}(\text{correct}_i \land \{\text{speedup}_i > p\})$

where $N$ is the total number of tasks (Ouyang et al., 14 Feb 2025, Li et al., 8 Jul 2025, Li et al., 18 Jul 2025).

Performance portability: Assesses how tuned kernels carry across hardware. Empirical studies have shown that the optimal tuning from one GPU yields between 58.5% to 99.9% of peak performance on another, indicating that per-architecture tuning is essential (Tørring et al., 2023).
Other relevant metrics: Include compilation accuracy, execution accuracy, convergence rate for autotuners, optimal speedup over median configuration, performance locality/minima centrality, and permutation feature importance (PFI) to assess the influence of tuning parameters (Tørring et al., 2023, Li et al., 8 Jul 2025).
Profiling and Feedback: The suite pipes compiler errors, execution errors, and performance profiles back into an iterative loop, mirroring how engineers troubleshoot and refine their code (Ouyang et al., 14 Feb 2025).

4. Methodological Innovations and Agentic Workflows

Recent research using KernelBench is distinguished by methodologies that leverage feedback-driven and reinforcement learning approaches:

Iterative refinement: LLMs are exposed to test-time feedback—compilation errors, correctness checks, and profiling—and can revise their kernel code over multiple cycles, closely emulating expert development. Iterative feedback proves crucial: on Level 2 (fused operation) tasks, iterative methods can boost fastₚ rates from 36% to as high as 72% for certain models (Ouyang et al., 14 Feb 2025).
Reward design in RL for code optimization: Models such as AutoTriton and CUDA-L1 use group-normalized relative policy optimization (GRPO) and a composite reward that spans both rule-based (syntax adherence) and execution-based (functional and speedup) components (Li et al., 8 Jul 2025, Li et al., 18 Jul 2025). CUDA-L1 further employs contrastive RL, where the model is explicitly tasked to compare and reason about multiple candidate kernels and their measured speedups.
Contrastive learning for optimization discovery: By studying and comparing code variants alongside their speedup metrics, models can learn fundamental CUDA optimization principles, such as the multiplicative benefit of stream management combined with CUDA graph usage, and can even reject so-called "harmful" optimizations discovered empirically (Li et al., 18 Jul 2025).

5. Practical Implications and Observed Results

Empirical findings across several large studies reveal:

Difficulty of the task: Even powerful models such as R1, o1, DeepSeek-R1, and Claude-4-Sonnet achieve functionally correct and high-speedup kernel generations in fewer than 20% of cases under naive (one-shot) decoding (Ouyang et al., 14 Feb 2025). Most generated kernels either fail to compile, fail functional checks, or fail to deliver speedup; however, notable outliers exist where LMs exceed baseline performance by over 10×.
Reinforcement learning and iterative workflows: RL-trained and feedback-integrated models (e.g., AutoTriton, CUDA-L1) show marked improvements over vanilla LLMs, including substantially higher compilation and execution accuracy and median speedups exceeding $17\times$ on KernelBench, generalizing well to unseen GPU architectures (Li et al., 8 Jul 2025, Li et al., 18 Jul 2025).
Portability and hardware-specific tuning: Optimal parameter values differ significantly between GPUs (e.g., an RTX3060-configured kernel yields only 73% of the optimal on an RTX2080Ti) (Tørring et al., 2023). However, RL-discovered optimization strategies transfer well, with minimal efficiency loss across A100, H100, RTX3090, L40, H20, and others (Li et al., 18 Jul 2025).
Insights into optimization: Contrastive RL agents independently (re)discover known and non-obvious optimization strategies. The multiplicative interaction between certain kernel optimizations (such as stream management as a "gatekeeper" for CUDA graph speedup) is empirically demonstrated, providing deeper insight into the structure of effective kernel programming (Li et al., 18 Jul 2025).

6. Integration with Autotuning, Synthesis, and Trace-Based Analysis

KernelBench supports versatile integration for research in:

Autotuning research: The suite provides parameterized kernels, shared interfaces for new tuners (Python and C++), and standardized metrics, making it a reference testbed for comparing search algorithms (random search, Optuna, SMAC3, Kernel-Tuner, KTT, etc.) (Tørring et al., 2023).
Program synthesis and static analysis: KernelBench benchmarks the effectiveness of synthesis-based compilers and fault localization methods that must generate safe, high-speed kernels or accurately pinpoint kernel bugs in large codebases (Zhou et al., 26 May 2025, Xu et al., 2021).
Dynamic kernel extraction: KernelBench can be used in conjunction with trace analysis tools (e.g., TraceAtlas) to identify or isolate kernels directly from application traces for further benchmarking and optimization (Uhrie et al., 2020).

7. Open Challenges and Future Research Directions

Despite recent progress, several challenges persist:

Error-proneness: The majority of automatically generated or tuned kernels still fail to meet correctness and performance criteria, particularly for high-speed or hardware-specific optimizations (Ouyang et al., 14 Feb 2025).
Data scarcity and domain adaptation: CUDA and Triton remain low-resource languages in public training corpora, limiting the effectiveness of LLM-based methods (Ouyang et al., 14 Feb 2025, Li et al., 8 Jul 2025).
Consistency and coverage: Certain kernel types (e.g., unconventional convolutions or complicated fusions) remain unaddressed even after repeated sampling. The performance of generated kernels can vary significantly with GPU microarchitecture (Ouyang et al., 14 Feb 2025).
Advanced agentic workflows: Future pathways include deeper integration of execution and profiling feedback, richer open-source kernel datasets, and expansion of benchmarking beyond GPUs to other accelerators. The development of generalized agentic workflows for continuous kernel optimization is highlighted as a promising avenue for reducing manual engineering effort and improving deployment efficiency (Ouyang et al., 14 Feb 2025, Li et al., 8 Jul 2025, Li et al., 18 Jul 2025).

Taken together, KernelBench serves as a critical nexus for the evaluation and advancement of kernel optimization methodologies. It offers rigorous benchmarks that directly map to practical performance gains in AI systems and high-performance computing, and its ongoing evolution continues to shape the landscape of automated code generation, autotuning, and reinforcement learning-based software engineering.