CUDA Agent: Teaching AI to Write Lightning-Fast GPU Code

CUDA Agent demonstrates how reinforcement learning can transform language models into expert GPU kernel optimizers. By combining curriculum-based task synthesis, skill-integrated agentic workflows, and stability-focused RL training, the system surpasses both traditional compilers and leading proprietary models on the KernelBench benchmark, achieving near-perfect correctness and substantial speed improvements across all difficulty levels.
Script
Generating high-performance CUDA kernels demands expertise that even advanced language models struggle to match. While these models excel at general programming, they fall short when competing against optimizing compilers for GPU code—until now.
The problem stems from a fundamental limitation. Existing approaches rely on surface-level refinement without teaching the model to truly understand GPU optimization. CUDA Agent takes a radically different path, using reinforcement learning to embed performance expertise directly into the model's capabilities.
This required solving three interconnected challenges.
The scarcity of expert CUDA code creates a chicken-and-egg problem. The authors synthesized 6,000 diverse kernel tasks by fusing primitive operators, then embedded them in an agentic environment where the model iteratively analyzes bottlenecks and optimizes. But the real innovation lies in preventing training collapse—CUDA kernels are so rare in pretraining that standard RL fails catastrophically within 150 steps.
The results are striking. CUDA Agent achieves near-perfect correctness and outperforms the torch.compile compiler on 100% of simple and intermediate kernels, and 92% of the most challenging Level 3 tasks. Against Gemini 3 Pro and Claude Opus 4.5, it delivers roughly 40% higher speed-ups on complex kernels—the first time a trained model has decisively beaten both compilers and frontier proprietary systems.
Examining the optimization trajectories reveals systematic patterns. The agent didn't just generate faster code—it rediscovered expert strategies. It learns to recognize when mathematical structure allows radical simplification, when sequential operations should be fused to eliminate memory round-trips, and when to exploit hardware-specific features like tensor cores or vendor-optimized libraries.
None of this would work without the warm-up strategy.
Ablation studies confirm that removing actor initialization or critic pretraining causes immediate reward collapse. The domain mismatch is so severe that the model's initial policy produces wildly variable importance sampling ratios, destabilizing learning. The warm-up stages aren't optional optimizations—they're the difference between success and total failure.
CUDA Agent does more than generate fast code—it demonstrates that agentic reinforcement learning can automate expertise traditionally locked in human specialists. This matters because GPU optimization is a bottleneck in AI infrastructure. By transforming passive code generators into active performance engineers, the system opens a path toward making expert-level optimization accessible at scale.
CUDA Agent proves that with the right curriculum, environment, and training stability, language models can learn to reason about hardware performance the way experts do. To explore more research like this and create your own video presentations, visit EmergentMind.com.