Papers
Topics
Authors
Recent
2000 character limit reached

AKG Kernel Agent Framework

Updated 5 January 2026
  • AKG Kernel Agent is a modular, multi-agent system that automates the synthesis, migration, and tuning of computational kernels across various hardware and DSL environments.
  • It employs specialized agents—Conductor, Designer, Coder, and Verifier—that collaboratively generate hardware-agnostic code and optimize performance using adaptive feedback.
  • Its document-driven framework and iterative, search-based optimization significantly reduce manual engineering, accelerating efficient deployment in AI workloads.

The AKG Kernel Agent is a modular, multi-agent system for automated synthesis, migration, and performance tuning of computational kernels across diverse hardware and software backends. Its architecture integrates LLMs with document-driven DSL tooling and adaptive search algorithms, providing correctness, portability, and tunable performance for state-of-the-art AI workloads. The framework supports cross-platform code generation through a hardware-agnostic intermediate representation and tightly orchestrated agent collaboration, markedly reducing reliance on manual kernel engineering and accelerating deployment to new accelerators and environments (Du et al., 29 Dec 2025).

1. Motivation and Scope

Modern AI models, including LLMs, recommendation systems, and multimodal architectures, demand computation kernels that are highly optimized for both correctness and throughput. Legacy approaches to kernel development—manually coding and tuning low-level routines for each CPU, GPU, or NPU architecture—are not sustainable in the face of rapidly evolving hardware, frequent changes to quantization formats, and increasingly heterogeneous computing requirements. Even with the abstraction gains introduced by domain-specific languages (DSLs) such as Triton, TileLang, CUDA-C, and C++ wrappers, domain experts remain a bottleneck due to the intricacy of memory hierarchies, parallelization schemes, and operator-specific fusion opportunities. Single-agent LLM-based kernel generation pipelines suffer from incomplete hardware knowledge and brittle performance when spanning multiple DSLs and architecture targets. AKG Kernel Agent addresses this fundamental scalability issue by decomposing the kernel generation and optimization pipeline into multiple, specialized and interacting agents, each with access to runtime documentation, knowledge stores, and execution feedback (Du et al., 29 Dec 2025).

2. Multi-Agent System Architecture

AKG Kernel Agent employs a closed-loop, agentic architecture structured around four principal agents, coordinated by an orchestrator (“Conductor”) and supported by a document-driven integration (DDI) database. The system is modular and extensible, supporting both new DSLs and hardware without changes to the controller logic.

Principal Agents:

  • Conductor: Maintains execution history, tracks the state of each kernel generation attempt, adaptively delegates subtasks to Designer or Coder, and classifies errors from the verification process.
  • Designer: Produces a “Unified Sketch” S\mathcal{S}, a hardware-agnostic intermediate representation (IR) encoding declarations, core operations, control flow, and optimization primitives (e.g., tiling, pipeline hints), based on operator spec and hardware features.
  • Coder: Translates %%%%1%%%% into an executable kernel in a target DSL; uses a hierarchical retrieval subsystem for syntax and idiom correctness, guided by context- and error-driven hints from Conductor.
  • Verifier: Compiles and tests the generated kernel, ensuring correctness (element-wise error below a datatype-specific τ\tau) and measuring runtime performance for speedup reporting versus baseline implementations.

Supporting Component:

  • Document Store (DDI Framework): Contains DSL/hardware documentation in DocSpec format, API signatures, reference code and optimization tips, all of which are accessible to agents at runtime (Du et al., 29 Dec 2025).

Orchestration Pipeline:

Designer \rightarrow Coder \rightarrow Verifier \rightarrow Conductor \hookrightarrow DocSpec retrieval and feedback integration at each stage.

3. Workflow, Algorithms, and Adaptivity

The end-to-end workflow initiates when the Conductor receives a kernel synthesis request (operator spec + backend). The steps are:

  1. Sketch Generation: Designer generates a Unified Sketch S=(D,O,C,H)\mathcal{S} = (\mathcal{D}, \mathcal{O}, \mathcal{C}, \mathcal{H}) given hardware characteristics.
  2. Code Synthesis: Coder translates S\mathcal{S} into target DSL code, consulting retrieved documentation and examples filtered by operator type, shape, DSL, and backend using embedding-based retrieval.
  3. Verification: Verifier compiles and tests the code, quantifies element-wise errors

ei={yigenyirefyiref,yiref>ϵ yigenyiref,otherwisee_i = \begin{cases} \frac{|y_i^{gen} - y_i^{ref}|}{|y_i^{ref}|}, & |y_i^{ref}| > \epsilon \ |y_i^{gen} - y_i^{ref}|, & \text{otherwise} \end{cases}

and enforces {i:ei>τ}Nτ\frac{|\{i : e_i > \tau\}|}{N} \le \tau. Speedup is measured as Speedup=Tbase/Tgen\text{Speedup} = T_{base}/T_{gen} after warm-up.

  1. Feedback and Repair: The Conductor classifies errors (Syntax, ApiMisuse, Algorithm, MemoryPattern, etc.), reroutes the problem to the Coder or Designer as appropriate, and injects targeted suggestions based on failure logs.
  2. Iterative Optimization: AKG executes iterative search using an island model. Multiple parallel populations of candidate solutions evolve via stratified sampling, inspiration selection, and periodic migration. The best implementation across all islands is retained.

Key Algorithms:

  • Adaptive Routing: Orchestrates task assignment to Coder or Designer based on error classification.
  • Hierarchical Retrieval: Embedding-driven selection of relevant code/documentation for Coder, enforcing semantic consistency with shape/operator/hardware features.
  • Island Model: Parallel evolution of candidate sketches and code generations.

4. Supported Domain-Specific Languages and Portability Mechanisms

AKG Kernel Agent is architected for cross-platform compatibility. Its DocSpec-driven design supports:

  • Triton (CUDA, Ascend NPUs)
  • TileLang (CUDA)
  • CUDA-C
  • C++ with framework wrappers
  • AscendC (Du et al., 29 Dec 2025)

Portability Ensured by:

  • Unified Sketch IR: Hardware-agnostic, semantically annotated intermediate representation.
  • DocSpec Packages: Four-part schema for each DSL containing basic concepts, API catalog, expert suggestions, and reference examples.
  • Automated DocSpec Generation: Raw hardware/DSL documentation is preprocessed into agent-readable DocSpecs.
  • Dynamic Loading: Agents selectively load relevant documentation segments at runtime.

Addition of new DSLs or hardware backends involves supplying a new DocSpec with associated guidance, instantly leveraging the retrieval and generation pipelines without invasive code changes.

5. Optimization, Feedback, and Evaluation

AKG employs a multi-round, parallel, and iterative search for optimal kernel implementations:

  • Parallel Search (“Island Model”): KK islands run RR rounds of PP parallel candidates per island, employing stratified sampling, elite migration, and benchmarking against ground truth.
  • Verifier and Metrics: Correctness measured by pass@k rate; speedup is computed

pass@k=1Ni=1N[1(ncik)(nk)]\text{pass}@k = \frac{1}{N}\sum_{i=1}^N\left[ 1 - \frac{\binom{n-c_i}{k}}{\binom{n}{k}} \right]

and

GMspeedup=exp(1nilnsi)\mathrm{GM_{speedup}} = \exp\left(\frac{1}{n}\sum_i \ln s_i\right)

for nn kernels.

Empirical Results on KernelBench (Du et al., 29 Dec 2025):

Backend-DSL Geometric Mean Speedup \ge0.8×\times Baseline (%) \ge1.0×\times Baseline (%) pass@4 (Dynamic Shapes)
Triton-Ascend 1.46×\times 77.6 65.5 85.4
Triton-CUDA 1.06×\times 79.0 68.0 90.9
CPP-CPU 1.04×\times 64.8 54.9 --

AKG demonstrates robust correctness (pass@4 \approx 85–91%) and speedup (up to 1.46×\times over PyTorch eager, with much higher outliers for specific operator classes).

6. Modularity, Extensibility, and Analysis

AKG Kernel Agent's document-driven, modular architecture allows rapid integration of new DSLs and backends simply by supplying updated DocSpecs—no retraining or core changes required. The Unified Sketch approach decouples high-level algorithm and optimization intent from syntax, enabling one design to support multiple targets.

Insights:

  • Portability vs. Peak Performance: While AKG’s auto-generated kernels frequently surpass naive eager implementations and approach hand-tuned kernels for a wide range of ops (notably fusion and data-movement–dominated cases), vendor libraries such as cuBLAS or cuDNN retain an edge on heavily optimized conv/GEMM operations.
  • Quality Dependence: Efficacy of the code generator is strongly correlated with the relevance and completeness of documentation and reference examples in the system’s DocSpecs.
  • Scaling Tradeoffs: Parallel, search-based optimization increases LLM calls and benchmarking overhead, but consistently yields higher speedups compared to single-shot approaches. Potential future directions include the use of lightweight cost models or learned predictors to prune infeasible candidates before full synthesis and testing.

Limitations and Future Work:

  • Convolution and sliding-window patterns in non-vendor DSLs remain challenging and would benefit from enhanced sketch representations (e.g., tensor core/Windograd primitives).
  • Reinforcement-learning and reward-shaping strategies could further accelerate convergence.
  • Automated DocSpec generation and expansion would permit scaling to dozens of DSLs and hardware variants.

AKG Kernel Agent is situated within a new class of agentic kernel generation frameworks. Comparable systems include:

  • GEAK: Agentic framework for Triton-based kernel generation targeting AMD Instinct GPUs, featuring an explicit Reflexion-style reasoning loop and parallel candidate sampling. Demonstrates up to 63% execution-accuracy and 2.59×\times speedup in TritonBench evaluations (Wang et al., 31 Jul 2025).
  • PRAGMA: Incorporates fine-grained hardware profiling in its reasoning loop, with performance-driven agent feedback and historical best version tracking, demonstrating speedups up to 2.81×\times on CPUs and 2.3×\times on GPUs (Lei et al., 9 Nov 2025).
  • STARK: Implements a multi-agent, context-driven strategic optimization loop, maintaining a search tree and leveraging grounded instruction passing to achieve up to 16×\times improvements on GPU kernels (Dong et al., 19 Oct 2025).

These frameworks, along with AKG Kernel Agent, represent a significant trend toward the systematic automation of kernel synthesis and optimization via multi-agent, LLM-driven systems equipped with formally encoded knowledge and runtime profiles.


References:

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AKG Kernel Agent.