FlashInfer Trace: Unified GPU Kernel Benchmarking
- FlashInfer Trace is a unified schema that standardizes GPU kernel definitions, workloads, implementations, and evaluations for systematic benchmarking.
- It integrates real production traffic data to ensure representative workloads and employs rigorous correctness and performance metrics for LLM inference.
- The system features a dynamic apply() mechanism and leaderboard ranking that optimize AI-generated kernel deployment and facilitate error analysis.
FlashInfer Trace is the foundational unified schema within FlashInfer-Bench, a full-stack system designed to close the loop between LLM-driven GPU kernel generation and real-world LLM inference deployment. FlashInfer Trace enables consistent, rigorous communication among kernel definitions, instance workloads, AI-generated solutions, and empirical evaluations, thereby supporting the reproducible benchmarking, leaderboard, and dynamic deployment of AI-generated GPU kernels for production inference workloads (Xing et al., 1 Jan 2026).
1. Formal Schema and Trace Composition
At the heart of FlashInfer-Bench is the concept of a Trace:
where
- : The set of Definitions characterizing the mathematical contract of an operator, including input/output tensor shapes, data types, constraints, and a PyTorch reference implementation.
- : The set of Workloads, each binding a Definition’s variable axes to concrete values and associating either randomly sampled or real (dumped) tensors, typically encoded with safetensor.
- : The set of Implementations (Solutions) generated by LLM agents—code in Triton, CUDA, or similar DSLs, referenceable to Definition .
- : The set of Evaluations: immutable, comprising pass/fail correctness, summary statistics, hardware/memory environment traces, and performance metrics.
A Trace is physically represented as a JSON object with four root fields—definition, workload, solution, and evaluation—rendering it transportable, auditable, and re-executable. The explicit evaluation function is
This composition underpins all subsequent system layers by standardizing data exchange and result validation (Xing et al., 1 Jan 2026).
2. Dataset Construction from Real Serving Traffic
FlashInfer Trace curation is derived from live LLM serving traces. Traffic from DeepSeek-V3, Llama-3.1-8B, and Qwen3-30B-A3B under typical production configurations (e.g., FP8 quantization, tensor-parallel size 8) is processed on SGLang against ShareGPT prompts. Every GPU-kernel invocation is abstracted to a Definition only if I/O signature, axes, and static-axis values match exactly.
Each Definition aggregates a set of approximately 50 representative workloads, sampled for diversity in shape and tensor statistics. This de-duplication and sampling process ensures comprehensive coverage of production-relevant operator usage and kernel shape-space.
Summary statistics:
| Category | Value/Type |
|---|---|
| Kernel Definitions () | 41 (GEMM, Attention, MoE, Norm, Sampling, etc.) |
| Workloads () | 1,600 (~50/definition) |
| Solutions () | 240 (CUDA, Triton) |
| Evaluations () | 9,600 |
Workloads are serialized as small JSON blobs (either generation seed or safetensor path referenced). The curated dataset thus reflects realistic kernel/operator demand as experienced in production (Xing et al., 1 Jan 2026).
3. Correctness and Performance Benchmarking
FlashInfer Trace provides the input to FlashInfer-Bench’s correctness- and performance-aware benchmarking service. The framework enforces differentiated validation criteria based on operator category:
- (a) Deterministic Kernels (e.g., GEMM, RMSNorm): Pass if
- (b) Low-Precision Kernels (e.g., FP8 GEMM): Allow a fraction of outliers; require at least elements within strict error bounds.
- (c) Stochastic Kernels (e.g., sampling): Total Variation Distance (TVD) is computed over runs and required to satisfy with zero mask violations.
Performance is measured post-warmup across iterations via CUDA events. The system captures both raw solution and reference latencies, computing per-(definition, solution, workload) speedup:
The unified fast metric collapses correctness and speed:
A curve and its AUC jointly capture correctness and performance across all workloads for a solution (Xing et al., 1 Jan 2026).
4. Leaderboard Infrastructure and Evaluation Protocol
The public leaderboard ingests Trace submissions and reruns them against hidden workloads in isolated environments, publishing:
- Correctness rate ()
- curves for multiple ()
- Per-workload latency histograms
- Full-system end-to-end LLM latency changes post-apply()
Leaderboard ranking is by for each kernel, aggregated across kernels by macro-averaging. Current top agents at are gemini-2.5-pro, gpt-o3, and gpt-5-2025-08-07, with correctness rates () of 83.9%, 71.3%, and 48.8%, respectively (Xing et al., 1 Jan 2026).
5. Dynamic Deployment and apply() Mechanism
FlashInfer Trace enables production-grade zero-touch kernel substitution. The flashinfer_bench.apply() API leverages offline index construction: for each Definition , the set of passing solutions is indexed by workload key (typically parameterized by tensor shape). For each , the implementation with maximal speedup is identified and compiled ahead-of-time as needed. At runtime, input-dependent O(1) lookup selects and JITs (if required) the correct kernel:
1 2 3 4 5 6 |
def apply(definition, inputs): key = extract_key_from(inputs) i_star = Index[d][key] if not compiled(i_star): compile(i_star) return i_star.run(inputs) |
apply() injects only $1$–$2$ μs call overhead ( end-to-end) (Xing et al., 1 Jan 2026).
6. Experimental Findings, Error Modes, and Trade-offs
FlashInfer Trace infrastructure corroborates that apply()-level kernel speedups translate directly to system-level LLM inference improvements. For example, with the fused Add RMSNorm kernel on SGLang (Llama-3.1-8B-Instruct, batch 64):
| Solution | Kernel Latency (ms) | End-to-End Latency (ms) |
|---|---|---|
| FlashInfer baseline | 0.0112 | 934 |
| Gemini-2.5-pro (Triton) | 0.0160 | 939 |
| GPT-5 (Triton, slower) | 0.0247 | 1055 |
Triton solutions exhibit 2× higher correctness, with stable speedups due to hardware abstraction; CUDA solutions achieve higher peak speed but with lower correctness (30% vs 60% for Triton) and increased compilation errors. LLM agents frequently offload CUDA tasks to cuBLAS, indicating competence in library invocation but general weaknesses in low-level kernel tuning.
Observed error taxonomy:
- 94% of correctness failures: compilation errors or API misuse.
- 6%: runtime or numerical errors.
- Agents commonly fail to invoke new intrinsics (e.g., tcgen05) or manage multi-stage pipelines.
7. Limitations and Future Directions
FlashInfer Trace currently does not cover multi-GPU or collective communication patterns; model, hardware, and DSL support is restricted to B200-class GPUs and mainstream operators. Stochastic validator robustness to "reward hacking" is limited for non-deterministic kernels. Extension goals include:
- Expanding Trace schema to collective/all-reduce operators and sparse workloads.
- Benchmarking across a broader set of devices (e.g., H100, Blackwell) and DSLs (e.g., CUTLASS, TVM MetaSchedule).
- Integrating reinforcement learning or learned cost models to improve agent hardware understanding.
- Building fine-tuned kernel-generation models using FlashInfer Trace feedback, thus enabling a virtuous cycle of agent improvement.
FlashInfer Trace underpins the reproducible, systematic, and production-oriented benchmarking and deployment of AI-generated GPU kernels for LLM inference, bridging the gap between autonomous kernel synthesis by LLM agents and their validated, measurable integration into large-scale inference infrastructure.