FlashInfer Trace: Unified GPU Kernel Benchmarking

Updated 7 January 2026

FlashInfer Trace is a unified schema that standardizes GPU kernel definitions, workloads, implementations, and evaluations for systematic benchmarking.
It integrates real production traffic data to ensure representative workloads and employs rigorous correctness and performance metrics for LLM inference.
The system features a dynamic apply() mechanism and leaderboard ranking that optimize AI-generated kernel deployment and facilitate error analysis.

FlashInfer Trace is the foundational unified schema within FlashInfer-Bench, a full-stack system designed to close the loop between LLM-driven GPU kernel generation and real-world LLM inference deployment. FlashInfer Trace enables consistent, rigorous communication among kernel definitions, instance workloads, AI-generated solutions, and empirical evaluations, thereby supporting the reproducible benchmarking, leaderboard, and dynamic deployment of AI-generated GPU kernels for production inference workloads (Xing et al., 1 Jan 2026).

1. Formal Schema and Trace Composition

At the heart of FlashInfer-Bench is the concept of a Trace:

$T \in D \times W \times I \times E$

where

$D$ : The set of Definitions characterizing the mathematical contract of an operator, including input/output tensor shapes, data types, constraints, and a PyTorch reference implementation.
$W$ : The set of Workloads, each binding a Definition’s variable axes to concrete values and associating either randomly sampled or real (dumped) tensors, typically encoded with safetensor.
$I$ : The set of Implementations (Solutions) generated by LLM agents—code in Triton, CUDA, or similar DSLs, referenceable to Definition $d \in D$ .
$E$ : The set of Evaluations: immutable, comprising pass/fail correctness, summary statistics, hardware/memory environment traces, and performance metrics.

A Trace $T$ is physically represented as a JSON object with four root fields—definition, workload, solution, and evaluation—rendering it transportable, auditable, and re-executable. The explicit evaluation function is

$\mathrm{Benchmark} : D \times I \times W \to E$

This composition underpins all subsequent system layers by standardizing data exchange and result validation (Xing et al., 1 Jan 2026).

2. Dataset Construction from Real Serving Traffic

FlashInfer Trace curation is derived from live LLM serving traces. Traffic from DeepSeek-V3, Llama-3.1-8B, and Qwen3-30B-A3B under typical production configurations (e.g., FP8 quantization, tensor-parallel size 8) is processed on SGLang against ShareGPT prompts. Every GPU-kernel invocation is abstracted to a Definition only if I/O signature, axes, and static-axis values match exactly.

Each Definition $d \in D$ aggregates a set $W_d$ of approximately 50 representative workloads, sampled for diversity in shape and tensor statistics. This de-duplication and sampling process ensures comprehensive coverage of production-relevant operator usage and kernel shape-space.

Summary statistics:

Category	Value/Type
Kernel Definitions ( $\|D\|$ )	41 (GEMM, Attention, MoE, Norm, Sampling, etc.)
Workloads ( $\|W\|$ )	1,600 (~50/definition)
Solutions ( $\|I\|$ )	240 (CUDA, Triton)
Evaluations ( $\|E\|$ )	9,600

Workloads are serialized as small JSON blobs (either generation seed or safetensor path referenced). The curated dataset thus reflects realistic kernel/operator demand as experienced in production (Xing et al., 1 Jan 2026).

3. Correctness and Performance Benchmarking

FlashInfer Trace provides the input to FlashInfer-Bench’s correctness- and performance-aware benchmarking service. The framework enforces differentiated validation criteria based on operator category:

(a) Deterministic Kernels (e.g., GEMM, RMSNorm): Pass if $\forall j, |y_\mathrm{sol}[j] - y_\mathrm{ref}[j]| \leq \epsilon_\mathrm{abs} + \epsilon_\mathrm{rel} |y_\mathrm{ref}[j]|$
(b) Low-Precision Kernels (e.g., FP8 GEMM): Allow a $(1-\rho)$ fraction of outliers; require at least $\rho \cdot N_\text{total}$ elements within strict error bounds.
(c) Stochastic Kernels (e.g., sampling): Total Variation Distance (TVD) is computed over $m$ runs and required to satisfy $\mathrm{TVD}(\hat{f}, q) = \frac{1}{2} \sum_x |\hat{f}(x) - q(x)| \leq \tau_{\mathrm{TVD}}$ with zero mask violations.

Performance is measured post-warmup across $m$ iterations via CUDA events. The system captures both raw solution and reference latencies, computing per-(definition, solution, workload) speedup:

$\mathrm{Speedup}(d, i, w) = \frac{\mathrm{latency}_\mathrm{ref}(d,w)}{\mathrm{latency}_\mathrm{sol}(d,i,w)}$

The unified fast $_p$ metric collapses correctness and speed:

$\mathrm{fast}_p(i,d) = \frac{1}{|W_d|} \sum_{w \in W_d} \mathbf{1} \{\mathrm{Corr}(d,i,w) = 1 \wedge \mathrm{Speedup}(d, i, w) > p \}$

A $\mathrm{fast}_p$ curve and its AUC jointly capture correctness and performance across all workloads for a solution (Xing et al., 1 Jan 2026).

4. Leaderboard Infrastructure and Evaluation Protocol

The public leaderboard ingests Trace submissions and reruns them against hidden workloads in isolated environments, publishing:

Correctness rate ( $\mathrm{fast}_0$ )
$\mathrm{fast}_p$ curves for multiple $p$ ( $\{0, 0.25, 0.5, 0.75, 0.95\}$ )
Per-workload latency histograms
Full-system end-to-end LLM latency changes post-apply()

Leaderboard ranking is by $\mathrm{fast}_{0.95}$ for each kernel, aggregated across kernels by macro-averaging. Current top agents at $\mathrm{fast}_{0.95}$ are gemini-2.5-pro, gpt-o3, and gpt-5-2025-08-07, with correctness rates ( $\mathrm{fast}_0$ ) of 83.9%, 71.3%, and 48.8%, respectively (Xing et al., 1 Jan 2026).

5. Dynamic Deployment and apply() Mechanism

FlashInfer Trace enables production-grade zero-touch kernel substitution. The flashinfer_bench.apply() API leverages offline index construction: for each Definition $d$ , the set $S_d$ of passing solutions is indexed by workload key $k$ (typically parameterized by tensor shape). For each $k$ , the implementation with maximal speedup is identified and compiled ahead-of-time as needed. At runtime, input-dependent O(1) lookup selects and JITs (if required) the correct kernel:

def apply(definition, inputs):
    key = extract_key_from(inputs)
    i_star = Index[d][key]
    if not compiled(i_star):
        compile(i_star)
    return i_star.run(inputs)

Empirically, apply() injects only $1$–$2$ μs call overhead (

<0.8\%

end-to-end) (Xing et al., 1 Jan 2026).

6. Experimental Findings, Error Modes, and Trade-offs

FlashInfer Trace infrastructure corroborates that apply()-level kernel speedups translate directly to system-level LLM inference improvements. For example, with the fused Add RMSNorm kernel on SGLang (Llama-3.1-8B-Instruct, batch 64):

Solution	Kernel Latency (ms)	End-to-End Latency (ms)
FlashInfer baseline	0.0112	934
Gemini-2.5-pro (Triton)	0.0160	939
GPT-5 (Triton, slower)	0.0247	1055

Triton solutions exhibit $\sim$ 2× higher correctness, with stable speedups due to hardware abstraction; CUDA solutions achieve higher peak speed but with lower correctness (30% vs 60% for Triton) and increased compilation errors. LLM agents frequently offload CUDA tasks to cuBLAS, indicating competence in library invocation but general weaknesses in low-level kernel tuning.

Observed error taxonomy:

$\sim$ 94% of correctness failures: compilation errors or API misuse.
$\sim$ 6%: runtime or numerical errors.
Agents commonly fail to invoke new intrinsics (e.g., tcgen05) or manage multi-stage pipelines.

(Xing et al., 1 Jan 2026)

7. Limitations and Future Directions

FlashInfer Trace currently does not cover multi-GPU or collective communication patterns; model, hardware, and DSL support is restricted to B200-class GPUs and mainstream operators. Stochastic validator robustness to "reward hacking" is limited for non-deterministic kernels. Extension goals include:

Expanding Trace schema to collective/all-reduce operators and sparse workloads.
Benchmarking across a broader set of devices (e.g., H100, Blackwell) and DSLs (e.g., CUTLASS, TVM MetaSchedule).
Integrating reinforcement learning or learned cost models to improve agent hardware understanding.
Building fine-tuned kernel-generation models using FlashInfer Trace feedback, thus enabling a virtuous cycle of agent improvement.

(Xing et al., 1 Jan 2026)

FlashInfer Trace underpins the reproducible, systematic, and production-oriented benchmarking and deployment of AI-generated GPU kernels for LLM inference, bridging the gap between autonomous kernel synthesis by LLM agents and their validated, measurable integration into large-scale inference infrastructure.

Markdown Report Issue Upgrade to Chat

References (1)

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashInfer Trace.