Generalization of KernelBench-Optimized CUDA Kernels to Real-World Production

Determine how CUDA kernels optimized on benchmarks such as KernelBench—particularly those produced by RL-augmented LLM systems like Sakana AI's AI CUDA Engineer and CUDA-L1, which evaluate each operation on a single fixed configuration—translate to real-world production environments in terms of performance and applicability.

Background

The paper notes a recent surge of work using LLMs and reinforcement learning (RL) to automatically generate CUDA kernels, with many efforts evaluating on benchmarks like KernelBench. KernelBench includes diverse operations but evaluates each on a single, fixed configuration, leaving uncertainty about how such optimizations carry over to production workloads that vary widely in input sizes and conditions.

The authors explicitly state that it remains unclear whether kernels optimized under benchmark settings generalize to real-world deployment. This motivates their focus on matrix multiplication across 1,000 (M, N, K) configurations and comparisons with strong baselines (cuBLAS, cuBLASLt) to address practical applicability beyond benchmark scenarios.

References

Existing work, such as Sakana AI's AI CUDA Engineer and CUDA-L1, primarily optimize kernels from benchmarks like KernelBench, which covers a diverse range of CUDA operations, each evaluated on a single, fixed configuration (e.g., one specific input dimension). However, it remains unclear how these benchmark-optimized kernels translate to real-world production environments.

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning (2512.02551 - Su et al., 2 Dec 2025) in Section 1 (Introduction)