TritonBench: LLM GPU Kernel Benchmark

Updated 1 December 2025

TritonBench is a comprehensive, hardware-aware benchmark suite that evaluates both functional correctness and performance of LLM-generated Triton GPU kernels.
It combines real-world production kernels and synthesized operator tasks from GitHub and PyTorch-aligned channels to mirror diverse deep learning workloads.
It employs DSL-specific metrics and evaluation protocols, focusing on memory tiling, work-group scheduling, and hardware efficiency for optimized GPU programming.

TritonBench is a comprehensive, hardware-aware benchmark suite designed to evaluate the ability of LLMs to generate functionally correct and high-performance GPU kernels using the Triton domain-specific language (DSL). Triton, a Python-like DSL, has become a key enabler for efficient custom GPU operator development in deep learning frameworks by abstracting many of CUDA's low-level complexities. However, generating high-performance Triton kernels remains challenging and demands meticulous optimization of memory access, work-group scheduling, and hardware-specific features. Conventional code generation benchmarks insufficiently capture these requirements, focusing mainly on functional correctness. TritonBench remedies this by systematically measuring both correctness and hardware efficiency across a spectrum of operator synthesis tasks, establishing itself as the reference suite for research in LLM-driven GPU programming (Li et al., 20 Feb 2025, Wang et al., 31 Jul 2025, Li et al., 8 Jul 2025, Zhu et al., 25 Nov 2025).

1. Motivation and Problem Scope

The design of high-performance GPU kernels in Triton requires precise control over memory tiling, alignment, masking, grid/work-group configuration (tl.program_id, block sizes), and hardware-specific optimizations (shared memory, vectorization). Even for expert practitioners, peak hardware utilization typically necessitates extensive empirical tuning. While LLMs have demonstrated competence in general code generation, they lack the domain knowledge and performance sensitivity necessary for Triton kernel synthesis, resulting in subpar efficiency and correctness. This motivated the creation of TritonBench—a rigorous, targeted benchmark for evaluating both the syntactic/functional fidelity and the device-level efficiency of LLM-generated Triton code across real-world and synthetic workloads. Key challenges TritonBench exposes include DSL unfamiliarity, performance-unaware generation, and the combinatorial optimization inherent to practical GPU computing (Li et al., 20 Feb 2025).

2. Benchmark Suite Structure

TritonBench consists of two principal channels, each targeting a complementary operator synthesis regime:

TRITONBENCH-G (GitHub Channel): Harvests 184 unique, production-grade Triton kernels from 95 high-starred open-source repositories. Operators are manually curated, filtered by presence of @triton.jit, and rated for difficulty (d₁–d₅) per memory and scheduling complexity. This set covers an array of real-world primitives—Attention, MatMul, SoftMax, normalization, fused and pipelined kernels—mirroring open-source deep learning workloads (Li et al., 20 Feb 2025, Wang et al., 31 Jul 2025, Li et al., 8 Jul 2025, Zhu et al., 25 Nov 2025).
TRITONBENCH-T (PyTorch-Aligned Channel): Synthesizes 166 tasks by fusing combinations of 40 high-frequency and 40 low-frequency PyTorch operators, representing both common and under-represented operations. Each task is mapped to official PyTorch API calls and includes documentation for reproducibility and correctness assessment. A primary feature is the inclusion of naturally arising fusions found in deep learning pipelines (e.g., dropout→GELU→matrix multiply) (Li et al., 20 Feb 2025, Zhu et al., 25 Nov 2025).

Operator difficulty annotations, branch coverage (for correctness testing), and empirically determined input sizes/layouts are systematically provided. For AMD platforms, TritonBench is extended (as TritonBench-revised) with enhanced unit tests and output tolerance checks (Wang et al., 31 Jul 2025).

3. Evaluation Metrics and Protocol

The TritonBench evaluation protocol is characterized by a dual emphasis on functional and efficiency-oriented metrics, applied on supported hardware environments (NVIDIA A100, AMD MI250/MI300X). The suite enforces a standardized workflow: kernel synthesis → compilation → correctness check → performance benchmarking. Metrics adopted and formalized in the literature include (Li et al., 20 Feb 2025, Alessa et al., 3 Jul 2025, Li et al., 8 Jul 2025, Zhu et al., 25 Nov 2025):

Call Accuracy (CA):

$\text{CA} = \frac{\#\text{kernels that compile}}{N}$

Measures the fraction of generated kernels that successfully compile and are invokable.

Execution Accuracy (EA):

$\text{EA} = \frac{\#\text{kernels that compile %%%%2%%%% produce correct outputs}}{N}$

Denotes the fraction of synthesized kernels that are both compilable and pass output checks (exact or within a specified tolerance).

fastₚ (throughput threshold):

$\text{fast}_p = \frac{1}{N}\sum_{i=1}^{N} 1[\text{correct}_i \wedge (\text{speedup}_i > p)]$

Quantifies the frequency of kernels that, in addition to correctness, exceed p× speedup compared to a reference baseline (usually PyTorch Eager for T, repository reference for G).

Mean Speedup:

$\text{MeanSpeedup} = \frac{1}{N}\sum_{i=1}^{N} \text{speedup}_i$

Where $\text{speedup}_i = T_{\text{baseline}}/T_{\text{generated}}$ for each task.

Additional metrics in the original release include CODEBLEU similarity, kernel throughput (memory BW, FLOPs), and measured GPU efficiency (fraction of theoretical peak FLOPS) (Li et al., 20 Feb 2025). The harness supports warm-up and repeated runs for stable timing, test coverage for multiple input sizes, and compatibility with Triton’s official testing tools for both NVIDIA and ROCm stacks.

4. Operator Coverage and Task Difficulties

TritonBench’s task corpus spans a comprehensive spectrum:

Channel	No. Tasks	Kernel Types	Difficulty Distribution
TRITONBENCH-G	184	FlashAttn, BMM, fused/pipelined ops	d₁–d₅, biased to d₃–d₄
TRITONBENCH-T	166	Pointwise, reductions, BatchedOps	d₁ (simple) to d₅ (complex)

TRITONBENCH-G operators typically require mastery of shared-memory tiling, block swizzling, or multi-stage pipeline optimization. TRITONBENCH-T targets both beginner and advanced use cases, but is less dominated by combinatorial hardware-aware optimizations (Zhu et al., 25 Nov 2025). Comparison with other public kernel suites (e.g., KernelBench) highlights TritonBench’s uniquely high complexity and the prevalence of production-level, optimized primitives (Li et al., 20 Feb 2025, Wang et al., 31 Jul 2025).

5. Baseline Results and Empirical Observations

Empirical findings indicate a pronounced performance gap between LLM-driven synthesis and hand-optimized or reference implementations, especially for the real-world (G) channel:

On TRITONBENCH-G, execution accuracy for state-of-the-art LLMs (e.g., GPT-o1, DeepSeek-R1) peaks below 24%, with speedup rarely exceeding 1.5× relative to reference, and overall GPU efficiency in the 45–60% range (Li et al., 20 Feb 2025, Zhu et al., 25 Nov 2025).
TRITONBENCH-T is substantially easier: domain-specific models and augmented prompting achieve up to ≈54% execution accuracy and mean speedup up to 1.9× (Li et al., 20 Feb 2025, Zhu et al., 25 Nov 2025).
Agentic and reinforcement learning (RL) approaches (e.g., GEAK, AutoTriton, MTMC) systematically outperform pure prompting, raising execution accuracy on T to ≈55% (MTMC) and on G to ≈23% and higher, with speedup multiplicatively improved in both regimes (Wang et al., 31 Jul 2025, Li et al., 8 Jul 2025, Zhu et al., 25 Nov 2025).
RL fine-tuning and hierarchical frameworks (e.g., Macro Thinking Micro Coding) are particularly effective at decomposing the combinatorial search space of high-difficulty operators, increasing both correctness rate and speedup on challenging G tasks (Zhu et al., 25 Nov 2025).

Model	Channel	Call/Exec (%)	fast₁ (%)	MeanSpeedup
GPT-o1	G	N/A / 23.9	N/A	1.14
Qwen2.5-sft	G	N/A / 10.9	N/A	1.56
DeepSeek-R1	T (zero/one)	53.0 / 45.8	up to 22.89	up to 1.91
AutoTriton	G	15.76 / 15.76	7.61	N/A
MTMC (A100)	T	64.46 / 54.82	19.28	0.64

A notable finding is the strong difficulty gap: “G” kernels with dense hardware-aware optimizations generally defeat most LLMs; naive prompting yields single-digit execution accuracy. Hierarchical and RL-based agents maintain a consistent, but still modest, edge (Zhu et al., 25 Nov 2025, Li et al., 8 Jul 2025).

6. Error Analysis, Methodological Insights, and Community Recommendations

The principal bottlenecks for LLMs on TritonBench are rooted in DSL unfamiliarity and failure to encode or reproduce memory scheduling, tiling, and synchronization idioms accurately. Both syntax/name/reference errors and DSL-concept failures (e.g., improper use of tl.program_id, masked loads/stores) are prevalent, especially in zero-shot settings. One-shot inference with instructive in-domain exemplars mitigates runtime and logic errors, increasing accuracy by ≈10% in many cases (Li et al., 20 Feb 2025). Performance pitfalls frequently trace to missing or poorly chosen memory tiling, default/thread block size, or missing fusion, leading to reduced occupancy and bandwidth underutilization.

Recommended countermeasures—repeated across several works—all trace to TritonBench’s insights into DSL-specific learning:

DSL-aware pretraining: Exposure to Triton compiler IR and kernel traces familiarizes models with necessary primitives and memory layout patterns (Li et al., 20 Feb 2025).
Performance-guided decoding and RL: Direct throughput/GFLOP reward integration in decoding or policy optimization improves both fidelity and efficiency (Li et al., 8 Jul 2025).
Hierarchical agentic designs: Decoupling search of optimization strategies from implementation (Macro/Micro) heightens coverage of the vast optimization space (Zhu et al., 25 Nov 2025).
Curated, difficulty-stratified exemplars: Spanning typical and complex operators in few-shot pools aids generalization.
Automated static analysis: Post-hoc kernel checks for vectorization, memory shape alignment, and work-group sizing to catch code that would otherwise pass superficial tests but fail performance targets.

7. Implementations, Platform Extensions, and Limitations

The official TritonBench suite is open-source and supports both NVIDIA (CUDA) and AMD (ROCm MI250/MI300X) GPU backends. Reference implementations provide setup scripts, benchmark configuration files (in YAML/JSON), and Python APIs for reproducible, cross-platform evaluation. Recent works provide expanded suites (e.g., TritonBench-revised for ROCm, with additional unit tests and tensor-based output checks) and improved test coverage, encouraging broad community contributions (Wang et al., 31 Jul 2025).

Known limitations include: maximum of six unit tests per kernel (potential for false positives), lack of integration with test-case generation/mutation frameworks (e.g., EvalPlus), and focus on pointwise, reduction, and linear algebra primitives to the partial exclusion of 3D convolutions, sparse, and multi-tensor fusion (Wang et al., 31 Jul 2025). Efforts to support further kernel classes, multi-GPU/data-parallel scenarios, and comprehensive profiling (BW, GFLOP, occupancy counters via vendor APIs) are ongoing or proposed (Li et al., 20 Feb 2025).

References

“TritonBench: Benchmarking LLM Capabilities for Generating Triton Operators” (Li et al., 20 Feb 2025)
“Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks” (Wang et al., 31 Jul 2025)
“AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs” (Li et al., 8 Jul 2025)
“QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation” (Zhu et al., 25 Nov 2025)