Papers
Topics
Authors
Recent
Search
2000 character limit reached

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Published 22 Mar 2026 in cs.LG and cs.PF | (2603.21331v1)

Abstract: Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow-AI/autokernel.

Authors (2)

Summary

  • The paper presents an iterative agent-driven framework that automatically optimizes GPU kernels using a profile-prioritized search and Amdahl's law.
  • It employs a robust single-file edit-benchmark loop with Triton and CUDA backends to validate improvements through a five-stage correctness harness.
  • Empirical results on NVIDIA H100 demonstrate significant speedups in transformer-critical kernels, underscoring its efficiency for deep learning applications.

Problem Motivation and Context

Optimal GPU kernel development remains a major bottleneck in scaling the efficiency of modern deep learning systems, particularly those built upon large transformer architectures. While vendor libraries such as cuBLAS and cuDNN have set high watermark baselines for dense linear algebra and canonical tensor operations, rapid evolution in model architectures consistently produces new kernels that lack optimized library support. This includes workloads like grouped-query attention, SwiGLU activations, rotary positional encodings, and RMS normalization. Manually optimizing these kernels requires advanced domain knowledge of GPU microarchitecture, memory hierarchy, and software toolchains. The non-convex search landscape, parameterized by tile sizes, memory layouts, and precision tactics, exacerbates the expertise barrier and curtails the rapid prototyping of competitive kernels.

System Architecture and Pipeline

AutoKernel is an autonomous optimization framework that systematizes GPU kernel tuning, implemented in a 9,200+-line open-source Python stack (with agent instructions spanning 909 lines), supporting both Triton and CUDA~C++ backends. The system pipeline comprises three segmented phases:

  1. Profiling and Bottleneck Identification (Phase A): Given an arbitrary PyTorch model, AutoKernel profiles kernel-wise GPU utilization using torch.profiler, classifying spent time across nine supported operation types. An operation's impact on end-to-end latency is determined by applying Amdahl's law, thereby prioritizing effort for kernels that bottleneck the global runtime.
  2. Kernel Extraction and Plan Synthesis: Profiled kernels are extracted and mapped to standalone files with starter implementations, including model-relevant shapes and tolerance attributes. An explicit optimization plan with what-if projections is assembled, ranked by possible performance gain.
  3. Iterative Agent Optimization (Phase B): An LLM-powered code agent iteratively edits a single kernel implementation. Each modification is evaluatedโ€”using a fixed five-stage correctness harnessโ€”for functional and numerical soundness before benchmarking throughput. The agent persists only those modifications yielding validated improvements. Loop termination follows plateauing returns, convergence to near-peak utilization, or budgeted runtime/iteration thresholds.
  4. Post-Optimization Verification (Phase C): The best kernel variant, as chosen by the agent loop, enters full-model regression tests for correctness and speedup attribution.

Correctness is enforced via a robust five-stage verification harness: initial smoke test, shape sweep across 10+ configurations, adversarial stability probing, determinism assessment, and edge-case size coverage. Only after thoroughly passing all stages is a kernel considered for performance benchmarking, preventing regressions or invalid optimizations.

Algorithmic Approach and Design Rationales

AutoKernel employs a deliberate โ€œsingle-file edit-benchmark-keep/revertโ€ agent loop, inspired by Karpathyโ€™s autoresearch paradigm. The agent operates on a strict invariant: exactly one kernel file is touched per iteration, maintaining git-based experiment tracking for interpretable and reversible search trajectories. A six-tiered optimization playbook structures the agentโ€™s search space across:

  • Block and tile size tuning
  • Memory access optimizations (e.g., vectorization, prefetching, cache utilization)
  • Compute pattern variants (e.g., accumulator precision, epilogue fusion)
  • Advanced strategies (e.g., split-K, persistent kernels)
  • Architecture-specific adjustments (e.g., Hopper TMA ops, Ampere cp.async)
  • Operation-specialized transformations (e.g., online softmax, Welfordโ€™s normalization)

This approach explicitly sidelines architectural complexity found in multi-agent systems, focusing instead on process transparency, single-agent determinism, and reproducibility.

Backend Duality

AutoKernel's dual-backend support targets both fast prototyping and fine-grained architectural optimization. Triton enables sub-5 second JIT compilation and rapid iteration, effective for high-level kernel abstractions. CUDA~C++ unlocks tensor core primitives, warp shuffles, and bank-conflict-free shared memory patterns, offering full exposure to low-level hardware features when required. The systemโ€™s backend-agnostic harness allows direct performance comparison and migration between abstraction levels within the same optimization process.

Quantitative Results

Empirical evaluation on NVIDIA H100 hardware demonstrates AutoKernel's Triton starter kernels outperform PyTorch eager and torch.compile (max-autotune) baselines on the majority of relevant transformer-critical kernels:

  • RMSNorm: 5.29ร— over PyTorch eager, 2.83ร— over torch.compile, attaining 2,788 GB/s (83% of H100 DRAM peak).
  • Softmax: 2.82ร— over eager, 3.44ร— over torch.compile, and exceeding 2,800 GB/s.
  • Cross-Entropy: 2.21ร— over eager, 2.94ร— over torch.compile, reaching 2,070 GB/s.

All evaluated configurations (n=34) pass the entirety of the five-stage correctness harness, underscoring the systemโ€™s robustness in code generation and validation.

While memory-bound kernels exhibit maximal speedups, matmul performance highlights the limitations of source-level search: AutoKernelโ€™s Triton matmul reaches 278 TFLOPS (28% of H100 peak), surpassing torch.compile but not cuBLAS, emphasizing the sustained advantage of hand-tuned vendor libraries for dense GEMM workloads. The agent loop is explicitly tasked to close this gap iteratively, with an optimization trajectory that prioritizes block size and tiling sweeps before exploring advanced strategies.

Community Impact and Competitive Deployment

AutoKernel's autonomous capabilities extend beyond synthetic evaluation. Community deployment yielded a first-place result on the vectorsum_v2 B200 leaderboard, andโ€”most notablyโ€”a single-prompt Triton FP4 matmul kernel exceeded CUTLASS performance by 1.63โ€“2.15ร— across diverse shapes, peaking at 2,898 TFLOPS. This underscores the efficacy of encoding expert heuristics in agent-accessible playbooks and positions AutoKernel as a practical alternative to both manual tuning and closed vendor toolchains.

Theoretical and Practical Implications

AutoKernel demonstrates that an iterative, agent-driven optimization process, constrained by rigorous correctness gates and guided by profile-driven prioritization (via Amdahlโ€™s law), suffices to automate substantive portions of the GPU kernel engineering cycle. This validates the translatability of the "autoresearch loop" to low-level code optimization and indicates that expert-level kernel performance is tractable without multi-agent coordination or human-in-the-loop feedback, provided sufficient agent instruction and explicit evaluation.

The architecture is modular for future extensions: potential directions include distributed, population-based search, RL-guided mutation strategies leveraging hardware performance counters, multi-kernel fusion, and automated PTX or SASS analysis integration for deeper hardware-coupled optimization.

Practically, the system unlocks scalable deployment of optimized kernels for models and workloads as soon as they are conceived, reducing ecosystem lag due to library absence. The explicit export functionality for HuggingFace Kernels further accelerates broad distribution and community adoption of agent-generated performant kernels.

Conclusion

AutoKernel establishes an agent-driven, correctness-anchored, profile-prioritized framework for autonomous kernel optimization encompassing both rapid prototyping (Triton) and hardware-level specialization (CUDA~C++). The system achieves significant empirical speedups over state-of-the-art baselines in memory-bound regions and demonstrates agent competitiveness with human/expert-tuned libraries in community evaluations. Its design favors transparency, extensibility, and robustness, providing a compelling baseline and testbed for future research in autonomous code generation and iterative hardware-aware software synthesis (2603.21331).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.