Agentic Kernel Generation

Updated 2 January 2026

Agentic kernel generation is an automated approach that employs autonomous LLM agents in iterative loops to generate, validate, and optimize computational kernels.
It leverages multi-modal feedback—including static analysis, JIT profiling, and empirical tests—to ensure functional correctness and performance improvement.
These systems support heterogeneous hardware and DSLs, achieving significant speedups and broad operator coverage through adaptive, closed-loop workflows.

Agentic kernel generation refers to the automated synthesis, validation, and optimization of computational kernels or system-level primitives using agentic workflows—primarily those driven by LLMs operating in iterative, feedback-driven, and often multi-agent loops. These systems transform kernel enablement from a static human-engineered activity to an adaptive, scalable, and context-sensitive process, targeting diverse hardware and software stacks from AI accelerators to OS subsystems (Hammond et al., 3 Dec 2025, Wang et al., 31 Jul 2025, Zhang et al., 23 Oct 2025, Liao et al., 29 Dec 2025, Du et al., 29 Dec 2025, Dong et al., 19 Oct 2025, Zhang et al., 19 Nov 2025, Zheng et al., 1 Sep 2025).

1. Core Concepts and Definitions

Agentic kernel generation systems are distinguished by their use of autonomous agents—typically implemented via LLMs—which iteratively generate, assess, and refine kernel implementations. Unlike naive prompt-based or one-shot code generation, these systems use closed feedback loops incorporating both programmatic and empirical checks:

Agentic loop: The central workflow, frequently modeled as a finite-state machine (FSM) or as a search (tree or graph) traversed by cooperative agents. Stages include code generation, static analysis, compilation, hardware execution, and feedback extraction.
Coverage orientation: Prioritization of correct functional coverage across large kernel/operator sets, supporting all data types, signature patterns, and argument shapes (Hammond et al., 3 Dec 2025).
Multi-modal feedback: Integration of static program checks (linting, AST analysis), dynamic runtime profiling (JIT, hardware counters), empirical correctness (test harnesses), and knowledge retrieval from documentation or historical experience (Zhang et al., 23 Oct 2025, Liao et al., 29 Dec 2025).
Iterative refinement: Use of LLM-based agents or subagents specialized for code synthesis, error diagnosis, optimization suggestion, or plan decomposition, operating in feedback loops inspired by human engineering workflows.
Heterogeneous and cross-platform support: Compatibility with multiple hardware backends (e.g., NVIDIA, AMD, Meta MTIA, NPUs, CPUs) and diverse kernel DSLs (Triton, CUDA, CuTe, TileLang) (Liao et al., 29 Dec 2025, Du et al., 29 Dec 2025).

2. Architectures and Methodological Taxonomy

Several design archetypes for agentic kernel generation have converged in recent work:

System	Architecture	Agents / Submodules	Hardware / DSLs
TritorX (Hammond et al., 3 Dec 2025)	FSM per operator	LLM generator, Linter, Compiler, Test harness, Log Summarizer	Meta MTIA, Triton
GEAK (Wang et al., 31 Jul 2025)	Multi-agent pipeline	Generator, Evaluator, Reflector, Optimizer	AMD MI300X, Triton
CudaForge (Zhang et al., 23 Oct 2025)	Two-agent (Coder, Judge)	Correction, Optimization (hardware feedback)	CUDA, NVIDIA GPUs
KernelEvolve (Liao et al., 29 Dec 2025)	Graph search (universal operator)	Node selection, Universal operator, Eval, Retriever	NVIDIA, AMD, MTIA. Triton, CuTe, MLIR
AKG (Du et al., 29 Dec 2025)	Closed-loop, modular	Designer, Coder, Verifier, Conductor	Triton, CUDA-C, TileLang, CPP
STARK (Dong et al., 19 Oct 2025)	Tree search, multi-agent	Search controller, Plan agent, Code agent, Debug/Profiler	CUDA
AccelOpt (Zhang et al., 19 Nov 2025)	Beam-search loop	Planner, Executor, Summarizer, Memory	AWS Trainium/NKI
SchedCP (Zheng et al., 1 Sep 2025)	Multi-agent, decoupled OS	Observation, Planning, Execution, Learning	Linux eBPF, Schedulers

Most implementations structure kernel generation as an iterative process: (1) candidate generation via an LLM (often context-conditioned), (2) static or dynamic formal verification, (3) JIT compilation or hardware execution, and (4) response-driven prompt or memory updates. Architectures range from explicit FSMs (Hammond et al., 3 Dec 2025), beam or tree search (Zhang et al., 19 Nov 2025, Dong et al., 19 Oct 2025, Liao et al., 29 Dec 2025), to multi-agent modular systems (Du et al., 29 Dec 2025, Wang et al., 31 Jul 2025, Zhang et al., 23 Oct 2025).

3. Formal Decision Criteria and Optimization Objectives

Agentic kernel generation systems formalize correctness and fitness criteria as binary and continuous objectives grounded in hardware-realized execution:

Lint and static correctness: Candidate passes if all linter rules yield zero violations:

$\text{lint\_ok} = \bigwedge_{r\in\mathcal R} (\text{rule\_r\_violations} = 0)$

Functional correctness: Operator passes if outputs match a canonical backend within $\epsilon$ across all relevant test inputs:

$P_{op,t} = \begin{cases} 1 & \text{if } |\text{dev} - \text{cpu}| < \epsilon \ 0 & \text{otherwise} \end{cases}$

Coverage: Fraction of operators or benchmarks with complete pass rates; $S_{\text{op}} = 1$ indicates full correctness.
Performance objectives: Speedup relative to reference implementation, e.g., TritorX’s fitness:

$\mathcal{F}(v) = \frac{t_{\text{pytorch}}}{t_{\text{triton}}}$

Termination: Completion upon reaching target coverage, improvement stall, or artifact budget exhaustion:

$\tau(G_t) = (|V_t| \geq N_{\text{max}}) \vee (\exists v: \mathcal{F}(v) \geq F^*) \vee (\text{stall\_count} \geq M)$

Agent selection and expansion often use softmax, $\epsilon$ -greedy, or Monte Carlo Tree Search policies over observed fitness or coverage scores (Liao et al., 29 Dec 2025, Dong et al., 19 Oct 2025). In evaluation, systems report metrics such as median speedup, percent exceeding baseline, pass@K, and per-operator correctness.

4. Feedback Mechanisms and Context Management

Effective agentic kernel pipelines depend on multi-level and multi-modal feedback, including:

Static linter/AST analysis blocking unsafe or “cheating” constructs (e.g., host fallback, recursive ATen calls) (Hammond et al., 3 Dec 2025).
JIT compile/test failures summarized and filtered for prompt brevity; secondary LLMs often condense error logs (Hammond et al., 3 Dec 2025, Zhang et al., 23 Oct 2025).
Empirical profiling: Distributed execution on real hardware (FPGA/ASIC/GPU/CPU/NPU), capturing performance counters, occupancy, memory throughput (Liao et al., 29 Dec 2025, Wang et al., 31 Jul 2025, Zhang et al., 23 Oct 2025).
Retrospective experience/memory: Explicit archives of slow–fast kernel pairs, with summarizing LLMs to extract transferrable transformations (Zhang et al., 19 Nov 2025).
Contextual retrieval: Retrieval-augmented prompts fuse runtime bottlenecks, prior kernel variants, and documentation slices to inform the next iteration (Liao et al., 29 Dec 2025, Du et al., 29 Dec 2025).

Context management strategies include prompt truncation, focused tokenization (e.g., bottleneck-extracted artifacts only), and dynamic context windows specific to each agent’s role (planning, coding, debugging) (Dong et al., 19 Oct 2025).

5. Empirical Evaluation and Benchmark Results

Scalable agentic kernel systems report the following empirical capabilities:

Operator and primitive coverage: TritorX generated correct wrappers for 481/568 PyTorch ATen operators on MTIA (84.7% OpInfo coverage, $>20,000$ tests) (Hammond et al., 3 Dec 2025). KernelEvolve achieved $100\%$ correctness on 250 KernelBench problems and 160 ATen operators across three platforms (Liao et al., 29 Dec 2025).
Performance: Agentic systems yield consistent speedups. CudaForge attains a median $1.11\text{--}1.77\times$ over PyTorch on diverse GPUs (Zhang et al., 23 Oct 2025). KernelEvolve achieves up to $17\times$ on specific tasks, and AKG reports $1.46\times$ speedup on Triton-CUDA kernels (Du et al., 29 Dec 2025, Liao et al., 29 Dec 2025).
Efficiency and cost: CudaForge requires only $\$0.3 $per kernel with$ \sim 26.5$ min wall-clock, substantially below prior agentic baselines (Zhang et al., 23 Oct 2025).
Robustness & generality: Complex operator sets, broad datatypes, model-in-the-loop testing (NanoGPT, DLRM, MM1, MM2), and cross-platform adaptability are directly validated (Hammond et al., 3 Dec 2025, Liao et al., 29 Dec 2025).
Ablations: Removal of critical agents (linter, compilation log summarizer, optimizer) degrades coverage and performance significantly (Hammond et al., 3 Dec 2025, Wang et al., 31 Jul 2025).

6. Key Design Trade-offs and Future Directions

Coverage-first vs. performance-first: Systems like TritorX and KernelEvolve prioritize broad operator support and functional correctness, often deferring fine-grained autotuning. Others (GEAK, AccelOpt) explicitly tune for hardware efficiency post-correctness (Hammond et al., 3 Dec 2025, Liao et al., 29 Dec 2025, Zhang et al., 19 Nov 2025).
FSM vs. fully agentic execution: FSMs offer stringent control/reproducibility; agentic architectures with tool-enabling LLMs promise more flexible, adaptive workflows. The field is evolving toward hybrid agentic orchestration with tool APIs as first-class interfaces (Hammond et al., 3 Dec 2025, Du et al., 29 Dec 2025).
Context and memory management: Retrieval-augmented prompting and lightweight summarization ensure scalability and cost control; explicit long-term memory or archive-guided planning accelerates convergence and transfers optimization patterns (Liao et al., 29 Dec 2025, Zhang et al., 19 Nov 2025, Du et al., 29 Dec 2025).
Extensibility: Agentic frameworks are structured for rapid integration of new hardware backends and DSLs by swapping out DocSpecs, knowledge bases, or hardware constraints in context (Du et al., 29 Dec 2025, Liao et al., 29 Dec 2025).
Safety and validation: Production agentic deployment mandates strict anti-cheating policies, comprehensive test harnesses, and staged verification to guarantee correctness under all observed usage (Hammond et al., 3 Dec 2025, Zhang et al., 23 Oct 2025).

Possible directions include reinforcement or active learning for plan selection, embedding-based retrieval for optimization memory, multi-platform joint search, and coordinated agentic optimization across kernel, OS, and system stack subsystems (Liao et al., 29 Dec 2025, Zheng et al., 1 Sep 2025, Zhang et al., 19 Nov 2025).

7. Broader Context and Philosophical Underpinnings

The agentic kernel generation paradigm signifies a transition from monolithic, single-shot code generation toward open-ended, self-adaptive, and robust code synthesis systems, moving beyond traditional AI-hardware co-design cycles (Hammond et al., 3 Dec 2025, Liao et al., 29 Dec 2025, Du et al., 29 Dec 2025). The term "agentic kernel" also echoes research in cognitive architectures, where a minimal "functional kernel" enables autonomous emergence of higher-level cognitive functions through reflexive, schema-based self-organization (Serov, 2022). This analogy underscores the trajectory of future agentic kernel platforms: to provide the substrate from which both routine and emergent computation can be self-organized and optimized—potentially closing the last-mile gap in hardware–software co-evolution.