CUDA Agent: Autonomous Kernel Optimization

Updated 3 March 2026

CUDA Agent is a reinforcement-learning system that uses multi-agent collaboration to synthesize, optimize, and verify CUDA kernels.
It leverages large language models, automated profiling, and evolutionary strategies to deliver hardware-aware, continual kernel improvements.
Empirical benchmarks demonstrate significant speedups and high correctness across heterogeneous GPUs, highlighting its effectiveness in complex computational tasks.

A CUDA Agent is a multi-agent or reinforcement-learning-based system that autonomously generates, optimizes, and verifies CUDA kernels for high-performance GPU execution. Modern CUDA Agents leverage LLMs, automated profiling, evolutionary strategies, and feedback mechanisms to surpass traditional compiler heuristics and narrow expert-driven pipelines. The CUDA Agent paradigm now encompasses architectures enabling continual improvement, hardware-aware reasoning, long-term memory, and training via large-scale, reward-driven RL loops for kernel synthesis and optimization (Dai et al., 27 Feb 2026, Zhang et al., 23 Oct 2025, Chen et al., 18 Dec 2025, Dong et al., 15 Feb 2026, Wei et al., 9 Sep 2025, Du et al., 29 Dec 2025).

1. Architectural Paradigms and Agent Roles

CUDA Agents are structured as cooperating (multi-)agent systems, each assigned specialized responsibilities, coordinated within tightly integrated optimization workflows:

Task decomposition: Agents divide the overall kernel optimization process into generative, evaluative, and guidance-driven roles. For example, CudaForge employs a dual-agent loop: a Coder agent emits candidate CUDA kernels, while a Judge agent analyzes compilation outcomes, correctness, and profiler feedback to generate either correction or optimization directives (Zhang et al., 23 Oct 2025).
Specialized module hierarchy: Advanced agents introduce additional modularity, including SCE Managers (evolutionary control), Strategy Translators (semantic-level mutation), Kernel Revisors (code, correctness, and profiling), and Roofline Prophets (hardware guidance) as in cuPilot (Chen et al., 18 Dec 2025). Astra, as another example, decomposes kernel optimization into CodingAgent, PlanningAgent, ProfilingAgent, and TestingAgent, running in a multi-turn feedback loop (Wei et al., 9 Sep 2025).
Memory augmentation and long-term knowledge: Persisting cumulative experience in structured knowledge bases (KBs), as in KernelBlaster, allows agents to retrieve relevant optimizations based on new profiling states, enhancing cross-task generalization and search coverage (Dong et al., 15 Feb 2026).
Skill-augmented, tool-based environments: CUDA Agent integrates LLM-driven actions (e.g., code editing, compiling, profiling, searching) with system-provided reward signals, orchestrated by a robust, script-isolated tool suite to prevent reward manipulation (Dai et al., 27 Feb 2026).

2. Iterative, Feedback-Driven Optimization Workflow

At the core of CUDA Agent systems lies a cyclic process that tightly couples code generation and performance feedback:

Initialization: An LLM agent generates an initial CUDA kernel from a high-level specification (often a PyTorch or NumPy operator). This can involve heuristic strategy selection, template filling, or sampling from a strategy pool (Chen et al., 18 Dec 2025, Du et al., 29 Dec 2025).
Evaluation and Profiling: The kernel undergoes correctness verification and is instrumented with profiler tools (e.g., Nsight Compute). Hardware signals–occupancy, memory throughput, SM cycles, stall sources–are extracted (Zhang et al., 23 Oct 2025, Dong et al., 15 Feb 2026).
Diagnosis and Feedback: Profiling results and detected errors trigger focused diagnostic feedback. Agents (e.g., Judge in CudaForge or PlanningAgent in Astra) issue correction directives or targeted optimization suggestions (e.g., “memory-bound; fuse loads with shared memory”) (Zhang et al., 23 Oct 2025, Wei et al., 9 Sep 2025).
Action Selection: The system samples from a prioritized or population-based set of code transformations or high-level strategies, often using reinforced, evolutionary, or memory-guided selection (Chen et al., 18 Dec 2025, Dong et al., 15 Feb 2026).
Update and Iteration: Kernels are regenerated or mutated according to agent guidance. The loop repeats for a bounded number of rounds or until convergence criteria (e.g., speedup threshold, maximally correct) are met (typically up to 10 rounds in CudaForge; 200 in CUDA Agent) (Zhang et al., 23 Oct 2025, Dai et al., 27 Feb 2026).

A representative algorithmic loop from CudaForge:

for r in range(1, N+1):
    code_r = Coder.generate(task_desc, pytorch_ref, feedback_{r-1})
    success, out_or_err = compile_and_run(code_r, tests)
    if not success:
        feedback_r = Judge.correct(error_log=out_or_err, ...)
    else:
        ncu_metrics = ncu_profile(code_r)
        feedback_r = Judge.optimize(gpu_specs, ncu_metrics, code=code_r)
        record_speedup(r, out_or_err.latency, T_baseline)
    if convergence_detected():
        break
best_kernel = select_highest_speedup_correct_kernel()

(Zhang et al., 23 Oct 2025)

3. Optimization Strategies and Hardware Awareness

CUDA Agents operationalize a variety of CUDA-specific optimization techniques, dynamically chosen based on profiler data and analytical models:

Strategy encoding: cuPilot utilizes explicit “strategy” descriptions (JSON/DSL lists of high-level CUDA optimizations), evolving strategies via LLM-guided selection, crossover, and mutation. Strategies are applied to kernel templates and performance profiles are collected for multi-objective scoring (Chen et al., 18 Dec 2025).
Roofline guidance: Several agents (cuPilot, CudaForge) embed roofline analysis, utilizing operational intensity $I = \text{Flops}/\text{BytesTransferred}$ and device peaks to classify kernels as memory-bound or compute-bound, steering prompts for suitable optimizations (e.g., shared-memory tiling vs. instruction-level parallelism). Essence: $\text{Roofline}(I) = \min(C_{\max}, B_{\max} \cdot I)$ (Chen et al., 18 Dec 2025).
Key optimization actions include:
- Loop tiling and blocking for shared-memory access patterns
- Software pipelining of loads/computes
- Vectorization using float4/int4 loads, __half2 intrinsics
- Warp-level reductions with shfl instructions in lieu of shared-memory trees
- Kernel fusion to bypass intermediate memory movement
- Grid/block configuration tuning for SM occupancy balancing
- Fast-math intrinsic substitution for function evaluation (Zhang et al., 23 Oct 2025, Wei et al., 9 Sep 2025, Chen et al., 18 Dec 2025, Du et al., 29 Dec 2025)
Memory-augmented RL exploration: KernelBlaster augments optimization selection using a knowledge base of state→optimization→predicted score mappings; profiled signatures are clustered, and agents sample among historically successful actions using “textual gradients” derived from realized speedup discrepancies (Dong et al., 15 Feb 2026).
Reward specification and stabilization: CUDA Agent uses a robust discrete reward structure, not continuous speedup, to ensure signal reliability and prevent reward hacking. The reward function is defined as:

$r = \begin{cases} -1, & \text{if correctness fails}, \ 3, & \text{if } b(t, t_\mathrm{eager}) \land b(t, t_\mathrm{compile}), \ 2, & \text{if } b(t, t_\mathrm{eager}), \ 1, & \text{otherwise} \end{cases}$

with $b(t,t_0) = \mathbf{1}[(t_0-t)/t_0 > 5\%]$ and $t$ the measured kernel runtime (Dai et al., 27 Feb 2026).

4. Empirical Results: KernelBench and Comparative Metrics

Benchmarks on KernelBench Levels 1–3 form the standard evaluation protocol for CUDA Agent frameworks. Key published results include:

Framework	Pass Rate	Faster Rate vs torch.compile	Geometric Mean Speedup	LLM-agnostic	Multi-GPU Gen.	API/Dev Cost
CUDA Agent	100% (L1/L2), 94%	97% (L1), 100% (L2), 90% L3	1.87× (L1), 2.80× (L2)	Yes	Yes (H100, A100)	128 H20 GPUs, high
CudaForge	97.6%	70.8% (fast_1)	1.68× (avg), 1.107× (med)	Yes	Yes	$0.3 per kernel
cuPilot	--	--	3.09× (overall), 4.06× (GEMM)	Yes	Yes	--
KernelBlaster	--	--	1.43× (L1), 2.50× (L2)	Yes	Yes	moderate, open src
Astra	100%	--	1.32× (avg, SGLang)	Yes	--	--

Pass Rate: Fraction of tasks compiling and passing correctness Faster Rate: Fraction of correct tasks faster than baseline Geometric Mean Speedup: Speedup over baseline on correct tasks

CUDA Agent outperforms both static heuristics (torch.compile) and proprietary LLMs (Claude Opus 4.5, Gemini 3 Pro) on Level-3 problems (Full neural blocks): 90% Faster Rate and 1.52× mean speedup, with overall mean 2.11× (Dai et al., 27 Feb 2026). CudaForge demonstrates cost-effective optimization: ~26.5 min per kernel on RTX6000, $0.3 API cost, compared to$5 and 6 H100 GPU-hours for other agents (Zhang et al., 23 Oct 2025). cuPilot’s strategy-level evolution achieves strong speedup on GEMM (4.06×), activations (4.16×), and high utilization of hardware units (Chen et al., 18 Dec 2025). KernelBlaster enables continual cross-task optimization over multiple GPU generations by retrieving and updating state-action knowledge (Dong et al., 15 Feb 2026).

5. Generalization, Modularity, and Extensibility

CUDA Agents are designed to generalize across hardware backends, LLM model variants, and domain-specific languages:

Hardware adaptation: Feeding agent modules with device-specific specs (e.g., GPU architecture, register/sharedmem sizes) ensures optimal code generation and profiling across A100, RTX 4090, 3090, L40S, and Hopper architectures. Empirically, CudaForge and KernelBlaster report 100% correctness and proportional speedups on heterogeneous hardware (Zhang et al., 23 Oct 2025, Dong et al., 15 Feb 2026).
LLM model agnosticism: Replacing agentic actors with alternative LLMs (GPT-5, Claude-Sonnet-4, GPT-OSS-120B, etc.) maintains high correctness and performance, indicating modular separation and prompt-driven compatibility (Zhang et al., 23 Oct 2025).
Multi-DSL and platform support: AKG kernel agent supports Triton, TileLang, CPP, and CUDA-C, enabling migration and automatic kernel porting across GPU, CPU, and NPU targets. The Unified Sketch intermediate enables backend-agnostic codegen, broadening application domains (Du et al., 29 Dec 2025).
Population-based and evolutionary search: Multi-island genetic strategies, as in AKG’s Evolve and cuPilot’s population evolution, ensure diversity and escape from local minima.

6. Limitations and Prospects

Current limitations and ongoing research directions for CUDA Agents include:

Resource and infrastructure demands: Large-scale RL training (CUDA Agent) requires extensive GPU clusters and containerized environments. Efforts to reduce compute demands via off-policy RL, distilled agents, or smaller LLM backends are active (Dai et al., 27 Feb 2026).
Context window scaling: The prompt-based architecture imposes upper bounds on kernel or model size per iteration, motivating work on hierarchical decomposition and mixed-size agent routing (Dong et al., 15 Feb 2026).
Integration with existing compilers: There is limited published comparison against advanced compiler frameworks (TVM, Triton) in RL loops, though future CUDA Agents may incorporate mixed-mode compiler actions.
Extensible memory and knowledge bases: As persistent state-action memories grow (e.g., 50KB in KernelBlaster), retrieval latency may eventually bottleneck agent throughput. Hybrid approaches using lightweight RL policies distilled from memory contents are under consideration.
Cross-accelerator and cross-domain generalization: Multi-agent frameworks have begun to target AMD ROCm and NPUs, but device-specific heuristics must still be rediscovered or encoded.

7. Applications Beyond Kernel Synthesis

Beyond autonomous kernel optimization, the CUDA Agent paradigm extends to:

Physics and multi-agent simulation: Titan leverages a CUDA-accelerated, asynchronous agent model for real-time robotics and soft-body simulation, achieving up to 389 million primitive updates/s and efficiently orchestrating hundreds of agents/objects in simulated environments, with flexible topology adaptation and real-time reinforcement learning integration (Austin et al., 2019).
Automated code migration and heterogeneity: Modular agentic designs with unified intermediate representations (e.g., Unified Sketches) facilitate migration of optimized kernels across disparate accelerator architectures, supporting rapid deployment in heterogeneous inference serving and cross-domain training.

CUDA Agents thus represent an integration point for automated deep-optimization, LLM-guided synthesis, adaptive search, and robust performance modeling—enabling expert-level, hardware-specific code generation with minimal manual intervention (Dai et al., 27 Feb 2026, Zhang et al., 23 Oct 2025, Chen et al., 18 Dec 2025, Dong et al., 15 Feb 2026, Du et al., 29 Dec 2025, Wei et al., 9 Sep 2025, Austin et al., 2019).