Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlashInfer Trace: Unified GPU Kernel Benchmarking

Updated 7 January 2026
  • FlashInfer Trace is a unified schema that standardizes GPU kernel definitions, workloads, implementations, and evaluations for systematic benchmarking.
  • It integrates real production traffic data to ensure representative workloads and employs rigorous correctness and performance metrics for LLM inference.
  • The system features a dynamic apply() mechanism and leaderboard ranking that optimize AI-generated kernel deployment and facilitate error analysis.

FlashInfer Trace is the foundational unified schema within FlashInfer-Bench, a full-stack system designed to close the loop between LLM-driven GPU kernel generation and real-world LLM inference deployment. FlashInfer Trace enables consistent, rigorous communication among kernel definitions, instance workloads, AI-generated solutions, and empirical evaluations, thereby supporting the reproducible benchmarking, leaderboard, and dynamic deployment of AI-generated GPU kernels for production inference workloads (Xing et al., 1 Jan 2026).

1. Formal Schema and Trace Composition

At the heart of FlashInfer-Bench is the concept of a Trace:

TD×W×I×ET \in D \times W \times I \times E

where

  • DD: The set of Definitions characterizing the mathematical contract of an operator, including input/output tensor shapes, data types, constraints, and a PyTorch reference implementation.
  • WW: The set of Workloads, each binding a Definition’s variable axes to concrete values and associating either randomly sampled or real (dumped) tensors, typically encoded with safetensor.
  • II: The set of Implementations (Solutions) generated by LLM agents—code in Triton, CUDA, or similar DSLs, referenceable to Definition dDd \in D.
  • EE: The set of Evaluations: immutable, comprising pass/fail correctness, summary statistics, hardware/memory environment traces, and performance metrics.

A Trace TT is physically represented as a JSON object with four root fields—definition, workload, solution, and evaluation—rendering it transportable, auditable, and re-executable. The explicit evaluation function is

Benchmark:D×I×WE\mathrm{Benchmark} : D \times I \times W \to E

This composition underpins all subsequent system layers by standardizing data exchange and result validation (Xing et al., 1 Jan 2026).

2. Dataset Construction from Real Serving Traffic

FlashInfer Trace curation is derived from live LLM serving traces. Traffic from DeepSeek-V3, Llama-3.1-8B, and Qwen3-30B-A3B under typical production configurations (e.g., FP8 quantization, tensor-parallel size 8) is processed on SGLang against ShareGPT prompts. Every GPU-kernel invocation is abstracted to a Definition only if I/O signature, axes, and static-axis values match exactly.

Each Definition dDd \in D aggregates a set WdW_d of approximately 50 representative workloads, sampled for diversity in shape and tensor statistics. This de-duplication and sampling process ensures comprehensive coverage of production-relevant operator usage and kernel shape-space.

Summary statistics:

Category Value/Type
Kernel Definitions (D|D|) 41 (GEMM, Attention, MoE, Norm, Sampling, etc.)
Workloads (W|W|) 1,600 (~50/definition)
Solutions (I|I|) 240 (CUDA, Triton)
Evaluations (E|E|) 9,600

Workloads are serialized as small JSON blobs (either generation seed or safetensor path referenced). The curated dataset thus reflects realistic kernel/operator demand as experienced in production (Xing et al., 1 Jan 2026).

3. Correctness and Performance Benchmarking

FlashInfer Trace provides the input to FlashInfer-Bench’s correctness- and performance-aware benchmarking service. The framework enforces differentiated validation criteria based on operator category:

  • (a) Deterministic Kernels (e.g., GEMM, RMSNorm): Pass if j,ysol[j]yref[j]ϵabs+ϵrelyref[j]\forall j, |y_\mathrm{sol}[j] - y_\mathrm{ref}[j]| \leq \epsilon_\mathrm{abs} + \epsilon_\mathrm{rel} |y_\mathrm{ref}[j]|
  • (b) Low-Precision Kernels (e.g., FP8 GEMM): Allow a (1ρ)(1-\rho) fraction of outliers; require at least ρNtotal\rho \cdot N_\text{total} elements within strict error bounds.
  • (c) Stochastic Kernels (e.g., sampling): Total Variation Distance (TVD) is computed over mm runs and required to satisfy TVD(f^,q)=12xf^(x)q(x)τTVD\mathrm{TVD}(\hat{f}, q) = \frac{1}{2} \sum_x |\hat{f}(x) - q(x)| \leq \tau_{\mathrm{TVD}} with zero mask violations.

Performance is measured post-warmup across mm iterations via CUDA events. The system captures both raw solution and reference latencies, computing per-(definition, solution, workload) speedup:

Speedup(d,i,w)=latencyref(d,w)latencysol(d,i,w)\mathrm{Speedup}(d, i, w) = \frac{\mathrm{latency}_\mathrm{ref}(d,w)}{\mathrm{latency}_\mathrm{sol}(d,i,w)}

The unified fastp_p metric collapses correctness and speed:

fastp(i,d)=1WdwWd1{Corr(d,i,w)=1Speedup(d,i,w)>p}\mathrm{fast}_p(i,d) = \frac{1}{|W_d|} \sum_{w \in W_d} \mathbf{1} \{\mathrm{Corr}(d,i,w) = 1 \wedge \mathrm{Speedup}(d, i, w) > p \}

A fastp\mathrm{fast}_p curve and its AUC jointly capture correctness and performance across all workloads for a solution (Xing et al., 1 Jan 2026).

4. Leaderboard Infrastructure and Evaluation Protocol

The public leaderboard ingests Trace submissions and reruns them against hidden workloads in isolated environments, publishing:

  • Correctness rate (fast0\mathrm{fast}_0)
  • fastp\mathrm{fast}_p curves for multiple pp ({0,0.25,0.5,0.75,0.95}\{0, 0.25, 0.5, 0.75, 0.95\})
  • Per-workload latency histograms
  • Full-system end-to-end LLM latency changes post-apply()

Leaderboard ranking is by fast0.95\mathrm{fast}_{0.95} for each kernel, aggregated across kernels by macro-averaging. Current top agents at fast0.95\mathrm{fast}_{0.95} are gemini-2.5-pro, gpt-o3, and gpt-5-2025-08-07, with correctness rates (fast0\mathrm{fast}_0) of 83.9%, 71.3%, and 48.8%, respectively (Xing et al., 1 Jan 2026).

5. Dynamic Deployment and apply() Mechanism

FlashInfer Trace enables production-grade zero-touch kernel substitution. The flashinfer_bench.apply() API leverages offline index construction: for each Definition dd, the set SdS_d of passing solutions is indexed by workload key kk (typically parameterized by tensor shape). For each kk, the implementation with maximal speedup is identified and compiled ahead-of-time as needed. At runtime, input-dependent O(1) lookup selects and JITs (if required) the correct kernel:

1
2
3
4
5
6
def apply(definition, inputs):
    key = extract_key_from(inputs)
    i_star = Index[d][key]
    if not compiled(i_star):
        compile(i_star)
    return i_star.run(inputs)
Empirically, apply() injects only $1$–$2$ μs call overhead (<0.8%<0.8\% end-to-end) (Xing et al., 1 Jan 2026).

6. Experimental Findings, Error Modes, and Trade-offs

FlashInfer Trace infrastructure corroborates that apply()-level kernel speedups translate directly to system-level LLM inference improvements. For example, with the fused Add RMSNorm kernel on SGLang (Llama-3.1-8B-Instruct, batch 64):

Solution Kernel Latency (ms) End-to-End Latency (ms)
FlashInfer baseline 0.0112 934
Gemini-2.5-pro (Triton) 0.0160 939
GPT-5 (Triton, slower) 0.0247 1055

Triton solutions exhibit \sim2× higher correctness, with stable speedups due to hardware abstraction; CUDA solutions achieve higher peak speed but with lower correctness (30% vs 60% for Triton) and increased compilation errors. LLM agents frequently offload CUDA tasks to cuBLAS, indicating competence in library invocation but general weaknesses in low-level kernel tuning.

Observed error taxonomy:

  • \sim94% of correctness failures: compilation errors or API misuse.
  • \sim6%: runtime or numerical errors.
  • Agents commonly fail to invoke new intrinsics (e.g., tcgen05) or manage multi-stage pipelines.

(Xing et al., 1 Jan 2026)

7. Limitations and Future Directions

FlashInfer Trace currently does not cover multi-GPU or collective communication patterns; model, hardware, and DSL support is restricted to B200-class GPUs and mainstream operators. Stochastic validator robustness to "reward hacking" is limited for non-deterministic kernels. Extension goals include:

  • Expanding Trace schema to collective/all-reduce operators and sparse workloads.
  • Benchmarking across a broader set of devices (e.g., H100, Blackwell) and DSLs (e.g., CUTLASS, TVM MetaSchedule).
  • Integrating reinforcement learning or learned cost models to improve agent hardware understanding.
  • Building fine-tuned kernel-generation models using FlashInfer Trace feedback, thus enabling a virtuous cycle of agent improvement.

(Xing et al., 1 Jan 2026)


FlashInfer Trace underpins the reproducible, systematic, and production-oriented benchmarking and deployment of AI-generated GPU kernels for LLM inference, bridging the gap between autonomous kernel synthesis by LLM agents and their validated, measurable integration into large-scale inference infrastructure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashInfer Trace.