FlashInfer-Bench: GPU Kernel Benchmarking Framework

Updated 7 January 2026

FlashInfer-Bench is a standardized, closed-loop framework for automated generation, evaluation, and deployment of AI-generated GPU kernels in LLM inference systems.
It employs a formal trace schema linking kernel definitions, workloads, implementations, and evaluations to ensure reproducibility and rigorous performance standards.
The apply() API enables dynamic, near-zero overhead kernel substitution in production, significantly reducing system-level inference latency.

FlashInfer-Bench is a standardized, closed-loop framework designed for the systematic evaluation, benchmarking, and deployment of AI-generated GPU kernels in LLM inference systems. By enabling real-time feedback between kernel synthesis, performance testing, and live system substitution, FlashInfer-Bench establishes a reproducible process by which autonomous agent systems—such as LLMs fine-tuned for code synthesis—can iteratively improve and deploy high-performance operator kernels in production environments, including SGLang and vLLM (Xing et al., 1 Jan 2026).

1. Closed-Loop System and Workflow

FlashInfer-Bench implements a three-phase, virtuous cycle linking kernel generation, benchmarking, and deployment. In the kernel generation phase, autonomous agents ingest operator Definitions from the FlashInfer Trace schema and emit new kernel Solutions (e.g., source code in Triton or CUDA). Each (Definition × Solution) pair is systematically evaluated against a curated set of Workloads sourced from actual serving traces. Correctness (deterministic, low-precision, stochastic) and performance (latency, throughput, speedup) are measured in strict isolation to prevent interference or reward-hacking. The best-performing, validated kernels are dynamically substituted into production inference engines using the apply() mechanism, which ensures near-zero runtime overhead. As agents receive continuous feedback from leaderboard rankings and trace records, they iteratively refine their kernel synthesis, closing the learning and deployment loop.

2. Formal Trace Schema: Definitions, Workloads, Implementations, Evaluations

The core schema, FlashInfer Trace, organizes system interactions via four disjoint sets:

$\mathcal{K}$ $K$ : Kernel Definitions, each a tuple $(\mathit{name}, \mathit{op\_type}, \mathit{axes}, \mathit{inputs}, \mathit{outputs}, \mathit{ref})$ specifying the operator contract and reference implementation.
- $\mathit{axes}$ encode axis roles (const, var) and values; optional constraints as axis predicates.
$\mathcal{W}$ : Workloads. Each $w = (k, \mathit{uuid}, \alpha, \mathit{inputs})$ binds to a unique Definition $k$ , with shape assignments for variable axes and concrete tensor instantiations via random generators, safetensors, or scalars.
$\mathcal{I}$ : Implementations. Each Solution $i \colon k$ carries language, author, sources, and architectural specs.
$\mathcal{E}$ : Evaluations, each $e = (k, w, i, \mathit{status}, \mathit{env}, \mathit{correctness}, \mathit{perf}, \mathit{time})$ , logging benchmarking outcomes, correctness, and performance for a specific (Definition, Workload, Implementation) triplet.

This schema supports direct relational joins central to tracking provenance, reproducibility, and deployment safety in continuous integration scenarios.

3. Dataset Construction and Curation

The curated dataset is derived from production traffic logs (ShareGPT sessions, SGLang deployments of DeepSeek-V3, Llama-3.1-8B, Qwen3-30B-A3B), encompassing major kernel families: GEMM, paged/ragged Attention, Mixture-of-Experts (MoE), RMSNorm, and Sampling. FlashInfer-Bench provides 41 unique Definitions, each potentially instantiated with ≈50 Workloads after deduplication and shape/performance-diversity filtering (~1,600 Workloads total).

Dataset curation proceeds via:

Production inference engines record raw kernel invocations.
Grouping by I/O spec, axis roles, and consts.
Input tensors are dumped only when input distribution materially affects correctness or performance; otherwise, synthetic generation is used.
Deduplication and pruning are applied along axes (batch size, sequence length) using a heuristic that preferentially preserves the distribution’s tail.

This process targets downstream generalizability and workload diversity for benchmarking.

4. Benchmarking Methodologies

Benchmarking in FlashInfer-Bench consists of correctness and performance validation:

Correctness:
- Deterministic kernels: Elementwise error bounds, $|y_{\mathrm{sol}} - y_{\mathrm{ref}}| \le \epsilon_{\mathrm{abs}} + \epsilon_{\mathrm{rel}} |y_{\mathrm{ref}}|$ .
- Low-precision (e.g., FP8): Proportion $\rho$ of elements must pass error bounds.
- Stochastic (sampling): Empirical total variation distance $\mathrm{TVD}(\hat{\mathbf{f}}, \mathbf{q}) = \frac{1}{2} \sum_j |\hat{f}_j - q_j| \le \tau_{\mathrm{TVD}}$ .
Performance:
- Latency $L$ measured post-warm-up using CUDA events, with per-GPU locking.
- Throughput $T$ in TFLOPs or tokens/sec as appropriate.
- Speedup $S = \frac{L_{\mathrm{baseline}}}{L_{\mathrm{new}}}$ and throughput gain $\Delta T=T_{\mathrm{new}}-T_{\mathrm{baseline}}$ .

Task assignment to GPUs is governed by a multi-GPU scheduler utilizing a Hungarian-algorithm cost matrix, with EWMA updates and failover, supporting large-scale, reproducible sweeps. Subprocess isolation (fresh CUDA context teardown) prevents agent reward-hacking.

5. Leaderboard Design and Metrics

FlashInfer-Bench’s public leaderboard captures agent submissions (Trace Definition + Solution) and executes evaluation on both visible and hidden Workloads to defend against overfitting. Leader rankings derive from the $\mathrm{fast}_p$ metric curve (cf. KernelBench):

$\mathrm{fast}_p = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\mathrm{correct}_i \wedge S_i > p)$

Area under this curve (AUC) and key points (fast $_{0.95}$ ) form the primary ranking basis, stratified by kernel type and GPU model. Snapshotted datasets and code ensure reproducibility. Metrics tracked include correctness rate, speedup at multiple thresholds, per-workload latency deltas, and throughput improvements.

6. Production Substitution: apply() API

Dynamic operator substitution is achieved through the flashinfer_bench.apply API, supporting:

Decorator usage: Annotates kernel functions for automatic dispatch to the fastest available Solution for each Workload.
1 2
@flashinfer_bench.apply(definition="gemm_n128_k2048") def gemm_baseline(A, B): …

Imperative usage: Programmatic invocation mapping input args to a concrete Definition at runtime.

C = flashinfer_bench.apply(
    definition_resolver=lambda *args,**kw: "gemm_n128_k2048",
    args=(A,B), kwargs={}
)

Offline/AOT builds: At engine startup, the local Trace is filtered, per-Workload keys extracted, and the top-k Solutions compiled ahead-of-time; others remain JIT.
Online dispatch: Runtime shape/type-based index lookup (O(1)), hardware/software compatibility checks, and fallback to JIT as needed.

Empirical overhead is 1–2 $\mu$ s per call (B200 GPU), and end-to-end system integration incurs $<$ 0.8% additional latency (Llama-3.1-8B-Instruct via SGLang).

7. Experimental Results and Design Insights

Experimental evaluation highlights:

On compute-bound kernels (GEMM, GQA), LLM agents achieve $<$ 50% of human/SOTA speedups for most Workloads; on memory-bound kernels (RMSNorm), they achieve or surpass SOTA due to inherent bandwidth saturation.
Of 32 correctness failures, 30 were due to compilation errors (incorrect API usage, device/host mismatches, type/shape errors); only 2 were numerical/runtime errors (padding mistakes).
Language analysis:
- Triton yields higher correctness (70–80%) and robust speedups—with compiler-managed tiling and pipelining.
- CUDA has lower correctness ( $<$ 50%) but potential for higher peak speed (custom shared-memory, manual tuning). Agents sometimes fallback to cuBLAS, approaching reference speed but not showcasing kernel innovation.
Takeaways:

Compilation remains the dominant failure mode, suggesting a need for improved agent API knowledge.
Agents currently underexploit new hardware intrinsics; curriculum or reinforcement learning with hardware feedback are plausible improvements.
DSLs such as Triton lower agent cognitive load while enabling the compiler to optimize low-level scheduling; for maximal operator performance, CUDA mastery remains necessary.
Library call fallback, especially in CUDA, may inflate apparent agent performance and obscure real kernel synthesis skill. Training regimens may benefit from eliminating direct library access.

End-to-end experiments confirm that substituting agent-generated kernels with measured speedup using apply() produces proportional reductions in system-level inference latency.

FlashInfer-Bench enables a reproducible, scalable, and automated pipeline where LLM agents propose novel GPU kernels, receive instant correctness/performance feedback, and see validated solutions flow seamlessly into large-scale LLM inference systems—closing the loop between agent-driven programming and production deployment (Xing et al., 1 Jan 2026).

Markdown Upgrade to Chat

References (1)

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashInfer-Bench.