Papers
Topics
Authors
Recent
Search
2000 character limit reached

MultiKernelBench: Kernel Benchmark Suite

Updated 3 February 2026
  • MultiKernelBench is a comprehensive suite of benchmarks and frameworks for evaluating computational kernels across diverse hardware platforms and application domains.
  • It integrates power–performance co-optimization, runtime autotuning, and LLM-driven kernel generation to enhance energy efficiency and throughput in heterogeneous systems.
  • The framework’s modular design and detailed statistical modeling enable reproducible research and practical insights for both microarchitecture profiling and deep learning accelerator optimization.

MultiKernelBench constitutes a set of foundational resources, benchmarks, and frameworks designed for the systematic evaluation, characterization, and generation of computational kernels across diverse hardware platforms and application domains. Recent formulations target three main pillars: (1) benchmarking and power–performance co-optimization of concurrent GPU execution (Goswami et al., 2020), (2) autotuned kernel benchmarking for performance adaptability in heterogeneous systems (Petrovič et al., 2019), and (3) comprehensive, multi-platform LLM-based kernel generation evaluation across deep learning accelerators (Wen et al., 20 Jul 2025). These efforts seek to standardize and advance kernel research from low-level microarchitecture profiling to state-of-the-art machine learning driven code generation.

1. Foundational Scope and Historical Motivations

MultiKernelBench originated as a response to the growing need for systematized, reproducible workload evaluation and optimization as throughput accelerators (e.g., GPUs, NPUs, TPUs) proliferated in high-performance computing and machine learning infrastructures. Early efforts focused on empirical-cum-statistical approaches to power–performance characterization and co-optimization of concurrent kernel workloads on NVIDIA GPUs (Goswami et al., 2020). This included modeling spatial and temporal concurrency and its impact on throughput and energy efficiency.

As the complexity of hardware increased and manual kernel tuning became increasingly untenable, the scope expanded. MultiKernelBench emerged as a name for diverse efforts—including curated autotuned benchmarks for heterogeneous compute (Petrovič et al., 2019) as well as evaluation suites for LLM- (LLM-) powered kernel code generation across multiple vendor platforms (Wen et al., 20 Jul 2025). The evolving objective is to facilitate comparative, fine-grained, and extensible benchmarking, and to catalyze reproducible research in kernel design and optimization.

2. Architecture and Methodological Design

The MultiKernelBench paradigm encompasses three major classes of methodology:

A. Power–Performance and Concurrency-Oriented Suite (Goswami et al., 2020):

  • Selection starts with coverage of the thirteen "Berkeley dwarfs," ensuring representation across structured/unstructured grids, dense/sparse linear algebra, Monte Carlo, graph, dynamic programming, spectral, combinatorial, and sorting algorithms.
  • Intrinsic characterization is performed using microarchitecture-independent metrics and PCA, followed by multi-layer clustering to group kernels by both behavioral and power profiles.
  • Integration strategies include symmetric (contention-focused), asymmetric (complementary resource usage), and application-like (co-appearing) mixes. Automated code generation (Amalgam generator) produces workloads with controlled spatial/temporal overlap.
  • Measurement employs time-synchronized current-sensor instrumentation and fine-grained profiling.
  • Statistical affinity clusters guide the selection of representative pair and triple-kernel mixes for systematic evaluation.

B. Autotunable Benchmark Collection with KTT (Petrovič et al., 2019):

  • Features ten kernels from domains such as Krylov solvers, dense and batched linear algebra, stencil PDEs, molecular force fields, and single-particle cryo-EM analysis.
  • Exposes a tailored parameter space per kernel (e.g., work-group size, memory tiling, loop unrolling, atomicity, vectorization).
  • Tuning is orchestrated via KTT, supporting both exhaustive offline search and dynamic runtime autotuning, exploiting runtime JIT compilation and kernel invocation statistics. The parameter space is controlled to allow feasible in-situ search with analytically modeled amortization cost.

C. LLM-Based Kernel Generation Benchmark (Wen et al., 20 Jul 2025):

  • 285 distinct tasks covering 14 kernel categories (activation, broadcast, convolution, full-architecture, fusion, loss, math, matrix-multiply, normalization, optimizer, pooling, indexing, resize, reduce) and three hardware platforms (NVIDIA CUDA, Huawei AscendC, Google TPU Pallas).
  • Employs a modular backend abstraction segregating compilation, execution, and profiling steps per platform, allowing rapid extensibility (e.g., addition of Triton, HIP) with minimal code changes.
  • Benchmarks LLM kernel-generation performance using controlled prompt templates with "category-aware one-shot exemplars," which demonstrably enhance generalization and correctness compared to category-naive approaches.

3. Suite Composition and Kernel Diversity

MultiKernelBench’s efficacy in both traditional and LLM-generation contexts stems from carefully curated kernel sets and fine-grained task/categorization.

Representative Kernel Categories

Category Example Kernels / Tasks Architectural Focus
Linear Algebra GEMM, BiCG, Batched GEMM Compute- vs bandwidth-bound
Stencil/Convolution 2D Conv, Hotspot (heat diffusion), FFT Memory locality, cache tiling
Graph/Combinatorial BFS, Raytracing, Sorting (HY) Irregular access/divergence
Monte Carlo BN, BS, LM Random access, latency-hiding
Reduction/Pooling Tree-reduction, matrix pooling Inter-warp/thread aggregation
Specialized AES (logic), N-body, 3D Fourier Bit ops, O(n²) scaling, 3D-I/O

The multiplexed coverage ensures that the suites collectively span dense, sparse, regular, and highly irregular computational motifs encountered in modern data center, scientific, and ML workloads.

Principal component analysis (PCA) on both architecture-agnostic metrics and power–performance metrics confirms broad variance coverage (73% with 5 PCs, >93% with 3 PCs), supporting the diversity claim (Goswami et al., 2020).

4. Performance, Energy Efficiency, and Portability

Rigorous analysis reveals how kernel combinations and automated parameter tuning impact key system metrics:

  • Concurrency and Savings (Goswami et al., 2020): Spatial and temporal concurrency with optimized kernel mixes yields energy savings of 32% (GTX470), 26% (M2050), and 33% (K20). Performance-per-watt (IPW) scales near-linearly with concurrency for complementary mixes. Multi-kernel occupancy increases by up to 91% over single-kernel runs, supporting more efficient resource utilization.
  • Tuning-Induced Efficiency (Petrovič et al., 2019): Post-autotuning, select kernels reach 80–92% of theoretical peak throughput on modern GPUs. Dynamic autotuning achieves 88–96% of offline-oracle performance for real applications (e.g., cryo-EM), typically amortizing tuning overhead within a few hundred kernel invocations.
  • Portability: Ported kernels retain up to 83 % performance across NVIDIA generations, but notably less (31–47%) from GPU to {CPU, Xeon Phi}, underlining hardware-specific optimization needs.
  • Co-optimization: Multi-kernel mixes improve energy-delay-product (EDP) by 36–37%, but some mixes with excessive symmetric contention can degrade overall efficiency—highlighting the importance of intelligent resource affinity planning.

5. LLM-Based Kernel Generation: Structure, Metrics, and Findings

The recent iteration of MultiKernelBench (Wen et al., 20 Jul 2025) serves as a first comprehensive benchmark for the assessment of LLM–driven kernel generation across multiple architectures.

  • Platform- and Category-Coverage: 285 tasks, three platforms, 14 categories, including previously missing optimizer and fine-grained tasks.
  • Backend Abstraction: Unified backend interface supports plug-in extension for new accelerator families without changes to core logic.
  • Category-Aware Prompting: Provision of in-category exemplars for one-shot prompting; this yields 100–160% gains in Pass@1 correctness on emerging platforms such as Huawei AscendC and Google TPU.
  • Evaluation Metrics:
    • Compilation@k: At least one valid binary in k LLM generations
    • Pass@k: At least one output among k generations satisfies outputLLMoutputref<atol+rtoloutputref|output_{LLM} - output_{ref}| < atol + rtol\cdot |output_{ref}| with N=5N=5, atol=1102atol = 1 \cdot 10^{-2}, rtol=1102rtol = 1 \cdot 10^{-2}.
    • SpeedUpα_\alpha@k: Fraction of tasks where at least one generated kernel achieves speedup α\geq\alpha over PyTorch baseline.
  • Key Findings: Even top models solve only ~12% (103/855) of all platform-task pairs at k=1k=1 (greedy); CUDA is most tractable, AscendC and Pallas remain challenging due to data scarcity and toolchain divergence. Certain categories (Activation, Reduce, Broadcast) reach >80% Pass@1, whereas convolutions and large fused tasks remain unsolved.

6. Statistical Modeling and Analytical Framework

Across the various incarnations of MultiKernelBench, empirical and statistical modeling play a central role:

  • Affinity Clustering (Goswami et al., 2020): Ensemble clustering using both behavioral (e.g., instruction mix, divergence) and power (e.g., average/peak wattage, EDP) vectors groups kernels for actionable scheduling.
  • Tuning and Sampling Cost (Petrovič et al., 2019): Random sampling efficiency for autotuning is modeled via explicit formulas. For a fraction rr of "good" configurations, probability pp of finding at least one in SS random draws is Slog(1p)/log(1r)S \geq \log(1{-}p)/\log(1{-}r). Required invocation nn to amortize tuning cost for relative average performance rprp is given by n=rpS(tavg/tbest1)/(1rp)n = rp \cdot S \cdot ( t_{avg}/t_{best} - 1 ) / (1 - rp)
  • Performance/Energy Metrics (Goswami et al., 2020):
    • PavgP_{avg}, PpeakP_{peak}: time-averaged and instantaneous power
    • IPW=T/PavgIPW = T / P_{avg}: performance-per-watt
    • EDP=Etotal×TexecEDP = E_{total} \times T_{exec}: energy-delay-product
    • IPOIPO, IEOIEO: percent changes in PavgP_{avg}/EtotalE_{total} due to concurrency.

7. Extensibility, Limitations, and Prospects

MultiKernelBench’s modular framework accommodates emerging hardware and new kernel classes:

  • Extensibility (Wen et al., 20 Jul 2025): Backend plugins can be added for new platforms (e.g., AMD ROCm, language variants) in under 20 lines of code.
  • Limitations: Limited platform training data affects LLM generalization, especially for underrepresented APIs; kernel category (e.g., normalization, multi-dim reductions) and shape awareness remain open challenges.
  • Future Directions: Enhancements under consideration include shape-aware prompting, further sub-category partitioning, integration with retrieval-augmented generation for API doc access, and continual expansion as real frameworks introduce new operators.

A plausible implication is that as platforms diversify and workloads converge, MultiKernelBench will continue as a reference point for both quantitative comparison and qualitative advances in kernel generation, tuning, and deployment for heterogeneous compute environments.


Key Papers:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiKernelBench.