Papers
Topics
Authors
Recent
2000 character limit reached

KForge: Agent-Driven GPU Kernel Synthesis

Updated 24 November 2025
  • KForge is a platform-agnostic, agent-driven framework that automates the synthesis of high-performance GPU kernels for diverse AI hardware.
  • It employs two collaborative LLM agents: one for kernel generation with iterative refinement and another for performance analysis and optimization guidance.
  • Empirical results demonstrate significant speedups and high correctness rates, with effective cross-platform adaptation through single-shot knowledge transfer.

KForge is a platform-agnostic, agent-driven framework for program synthesis of GPU kernels targeting a broad range of AI hardware accelerators. It leverages two collaborative LLM agents—one for kernel generation and iterative refinement, another for performance analysis and optimization guidance—to automate synthesis of high-performance, functionally correct GPU kernels across divergent compiler and hardware ecosystems such as NVIDIA CUDA and Apple Metal. KForge’s architecture is explicitly designed for rapid adaptation to new platforms, requiring only a single-shot example or reference implementation to bootstrap knowledge transfer and delivering both correctness and meaningful performance gains without platform-specific human intervention (Sereda et al., 17 Nov 2025).

1. System Architecture: Two Collaborative LLM Agents

KForge introduces a modular agentic structure in which kernel synthesis and optimization are decomposed into two distinct, tightly coupled LLM-based agents:

This agent accepts a textual prompt pp (task description, single-shot example, and optional reference implementation), the previous kernel kt1k_{t-1}, and performance recommendation rt1r_{t-1}, synthesizing a new candidate kernel ktk_t. The agent is responsible for emitting the initial kernel, handling compilation and correctness errors by refining the code, and subsequently incorporating performance recommendations for further optimization.

Operating on the latest synthesized program kk, a textual directive oo (what metrics to analyze), and multiple views of profiling data {v0,...,vn}\{v^0,...,v^n\} (including CSV/API exports or screenshots), this agent parses raw profiler or GUI output, identifies principal performance bottlenecks (such as suboptimal occupancy or memory bandwidth utilization), and emits a single actionable recommendation rr for the generator.

  • Feedback Loop Workflow:

At each iteration tt, FF generates ktk_t, which is compiled and tested for correctness. Upon passing, profiling data are collected and submitted to GG, which returns a performance recommendation rtr_t. This recommendation is then appended to the next input of FF, closing the loop between functional correctness and targeted optimization. The iterative workflow is thus formalized as:

ktF(p,kt1,rt1),rtG(o,kt,{v0,...,vn})k_t \leftarrow F(p, k_{t-1}, r_{t-1}), \quad r_t \leftarrow G(o, k_t, \{v^0,...,v^n\})

This architecture enables a two-phase agentic workflow: a functional pass for correctness and an optimization pass for performance (Sereda et al., 17 Nov 2025).

2. Iterative Refinement: Functional and Optimization Passes

KForge’s synthesis is organized as a dual-phase iterative refinement loop:

Functional Pass

  1. Prompt Construction: The base prompt p0p_0 consists of the high-level task (e.g., “implement RMSNorm”), representative API usage, and a one-shot vector-addition example in both PyTorch and the target kernel language.
  2. Kernel Synthesis and Compilation: The generator issues k1F(p0)k_1 \leftarrow F(p_0). Failures during compilation trigger feedback via compiler error output ece_c, appended to the prompt, yielding k2k_2 upon subsequent invocation.
  3. Runtime and Functional Testing: Synthesized models are evaluated on random test inputs against a PyTorch reference. Mismatches δ\delta are incorporated into the next prompt until correctness is achieved.

Optimization Pass

  1. Profiling Data Acquisition:
    • On CUDA, profiling uses cudaProfilerApi and Nsight Systems, producing CSVs of per-kernel execution time TkT_k, memory bandwidth BB, occupancy OO, and launch overhead ToT_o.
    • On Metal, this stage leverages Xcode Instruments automation to capture screenshots of per-kernel memory and timeline data.
  2. Performance Analysis and Recommendation: Profiling artifacts are passed to GG, which deduces bottlenecks and returns a concise, precise recommendation (e.g., “increase threadgroup size from 64 to 256 and unroll inner loop by factor of 4”).
  3. Prompt Update for Generator: This recommendation is added as a directive to the subsequent prompt, guiding FF in the next iteration.

Performance Metrics:

Baseline (tbt_b) and generated kernel (tgt_g) run times are used to compute per-problem speedup:

speedupi=tbtg\text{speedup}_i = \frac{t_b}{t_g}

The system evaluates the fraction of correct and fast kernels:

fastp=1Ni=1N1[correcti(speedupi>p)]\text{fast}_p = \frac{1}{N} \sum_{i=1}^N 1[\text{correct}_i \wedge (\text{speedup}_i > p)]

For example, fast1\text{fast}_1 is the rate of on-par performance, and fast1.5\text{fast}_{1.5} is the superior performance rate.

The objectives can be informally described as minimizing end-to-end latency Ttotal=To+Tc+TmT_{\text{total}} = T_o + T_c + T_m, subject to thresholds on occupancy and bandwidth (Sereda et al., 17 Nov 2025).

3. Cross-Platform Synthesis and Knowledge Transfer

A signature feature is KForge’s platform-agnostic design and its practical realization of knowledge transfer:

  • Reference Implementations:

When synthesizing kernels for Apple Metal, the prompt includes an existing CUDA implementation. The generator agent demonstrates the ability to analogize between semantics—mapping CUDA’s block and thread indices (e.g., blockIdx\text{blockIdx}, threadIdx\text{threadIdx}, __syncthreads()\_\_syncthreads()) to Metal’s threadgroup and synchronization primitives.

  • Adaptation Example:

Kernel translation involves parsing the CUDA AST and systematically substituting CUDA intrinsics and launch config for Metal equivalents, such as:

1
2
3
4
5
6
7
8
9
10
function translate_cuda_to_metal(cuda_src):
    parse cuda_src into AST
    for each kernel in AST:
        rename __global__  kernel
        replace blockIdx.x, blockDim.x  threadgroup_position_in_grid.x, threads_per_threadgroup.x
        replace threadIdx.x  thread_position_in_threadgroup.x
        replace __syncthreads()  threadgroup_barrier(mem_flags::mem_threadgroup)
        map exp()  fast::exp()
        map __shfl_xor_sync()  simd_shuffle()
    emit Metalflavored source

  • Effectiveness:

Metal kernels synthesized with a CUDA reference attain up to 20–30% higher correctness rates and require fewer refinement iterations compared to prompts without a reference implementation.

This cross-system capability empirically validates the ability of LLM agents to transfer platform-dependent knowledge through a shared interface (Sereda et al., 17 Nov 2025).

4. Evaluation Methodology and Empirical Results

KForge is benchmarked on KernelBench, comprising 250 diverse PyTorch module workloads, partitioned by complexity:

  • Level 1: simple kernels,
  • Level 2: fusion candidates,
  • Level 3: full networks.

The testbed utilizes four NVIDIA H100 SXM5 GPUs for CUDA and five Mac Studio M4 Max 32-core GPU nodes for Metal. Thirty Metal-incompatible ops are excluded, yielding 220 valid Metal targets. Baselines include PyTorch eager mode and torch.compile where available.

Measurement Protocol

Each kernel undergoes 100 execution runs per trial (with 10 warmups), resetting the compilation context each time. Median values of tgt_g and tbt_b are collected.

Quantitative Results

  • CUDA Iterative Refinement (openai-gpt-5 baseline):
    • Level 1: fast160%\text{fast}_1 \approx 60\%
    • Level 2: fast145%\text{fast}_1 \approx 45\%
    • Level 3: fast135%\text{fast}_1 \approx 35\%, fast1.520%\text{fast}_{1.5} \approx 20\%
  • CUDA + Profiling Info (Level 3):
    • fast1.5\text{fast}_{1.5} improves from 7% to 11%
  • Metal with CUDA Reference + Profiling:

| Model | fast1\text{fast}_1 L1 | fast1\text{fast}_1 L2 | fast1\text{fast}_1 L3 | fast1.5\text{fast}_{1.5} L1 | fast1.5\text{fast}_{1.5} L2 | fast1.5\text{fast}_{1.5} L3 | |---------------|-------------------|--------------------|--------------------|------------------------|------------------------|------------------------| | claude-opus-4 | 13% | 50% | 16% | 6.6% | 15.4% | 2.0% | | openai-o3 | 42% | 47% | 28% | 8.8% | 11.5% | 4.0% | | openai-gpt-5 | 54% | 76% | 44% | 16.5% | 25.6% | 4.0% |

A case paper on a SqueezeNetFire batch-size sweep reveals that at smaller batches (8–32), KForge achieves 1.2–3.9×\times speedup over eager and torch.compile, while for larger batches (64–128), torch.compile surpasses KForge on throughput, though KForge continues to match or exceed eager mode performance. Percentages of problems with speedup above thresholds are used for overall reporting (Sereda et al., 17 Nov 2025).

5. Analysis of Contributions, Limitations, and Prospective Directions

Main Contributions

  1. Iterative LLM-Based Refinement: KForge replicates expert kernel engineering via a two-phase, LLM-driven refinement loop integrating both correctness and structured performance optimization.
  2. Agentic Modularity and Multimodal Inputs: The strict division between code generation and performance analysis enables systematic ingestion of both structured data (CSV/API) and unstructured inputs (GUI screenshots).
  3. Cross-Platform Generality: KForge is empirically shown to synthesize correct, performant kernels across fundamentally different accelerators (NVIDIA CUDA, Apple Metal), leveraging single-shot knowledge transfer.

Limitations

  • Profiling data complexity presents challenges: deriving optimal recommendations from hundreds of raw metrics can result in noisy or suboptimal proposals.
  • Iterative refinement tends to yield incremental optimizations; discovering new algorithmic strategies remains difficult.
  • Default tensor sizes are relatively small in benchmarks, making launch overhead disproportionately prominent and masking downstream compute/memory improvements.

Future Directions

  1. Support for Training Synthesis: Extending the framework to generate backward passes for training workloads.
  2. Leveraging IR Layers: Conditioning on compiler IRs (e.g., Triton, LLVM) to exploit expert-system optimizations.
  3. Formal Verification Integration: Ensuring generalization across tensor shapes and data types via formal guarantees.
  4. Richer Performance Objectives: Developing finer-grained, continuous speedup or roofline-guided reward metrics.

In summation, KForge represents the first LLM-driven, platform-agnostic kernel synthesis system to integrate collaborative agentic reasoning between code generation and profiling analysis. It delivers high correctness rates (over 90% on simple/fusion kernels), substantial speedups (over 20% of Level 3 kernels exceed 1.5×\times baseline performance), and achieves cross-platform transfer with single-shot references, obviating the need for hand-crafted accelerator-specific inputs (Sereda et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to KForge.