KForge: Agent-Driven GPU Kernel Synthesis
- KForge is a platform-agnostic, agent-driven framework that automates the synthesis of high-performance GPU kernels for diverse AI hardware.
- It employs two collaborative LLM agents: one for kernel generation with iterative refinement and another for performance analysis and optimization guidance.
- Empirical results demonstrate significant speedups and high correctness rates, with effective cross-platform adaptation through single-shot knowledge transfer.
KForge is a platform-agnostic, agent-driven framework for program synthesis of GPU kernels targeting a broad range of AI hardware accelerators. It leverages two collaborative LLM agents—one for kernel generation and iterative refinement, another for performance analysis and optimization guidance—to automate synthesis of high-performance, functionally correct GPU kernels across divergent compiler and hardware ecosystems such as NVIDIA CUDA and Apple Metal. KForge’s architecture is explicitly designed for rapid adaptation to new platforms, requiring only a single-shot example or reference implementation to bootstrap knowledge transfer and delivering both correctness and meaningful performance gains without platform-specific human intervention (Sereda et al., 17 Nov 2025).
1. System Architecture: Two Collaborative LLM Agents
KForge introduces a modular agentic structure in which kernel synthesis and optimization are decomposed into two distinct, tightly coupled LLM-based agents:
- Generation Agent ():
This agent accepts a textual prompt (task description, single-shot example, and optional reference implementation), the previous kernel , and performance recommendation , synthesizing a new candidate kernel . The agent is responsible for emitting the initial kernel, handling compilation and correctness errors by refining the code, and subsequently incorporating performance recommendations for further optimization.
Operating on the latest synthesized program , a textual directive (what metrics to analyze), and multiple views of profiling data (including CSV/API exports or screenshots), this agent parses raw profiler or GUI output, identifies principal performance bottlenecks (such as suboptimal occupancy or memory bandwidth utilization), and emits a single actionable recommendation for the generator.
- Feedback Loop Workflow:
At each iteration , generates , which is compiled and tested for correctness. Upon passing, profiling data are collected and submitted to , which returns a performance recommendation . This recommendation is then appended to the next input of , closing the loop between functional correctness and targeted optimization. The iterative workflow is thus formalized as:
This architecture enables a two-phase agentic workflow: a functional pass for correctness and an optimization pass for performance (Sereda et al., 17 Nov 2025).
2. Iterative Refinement: Functional and Optimization Passes
KForge’s synthesis is organized as a dual-phase iterative refinement loop:
Functional Pass
- Prompt Construction: The base prompt consists of the high-level task (e.g., “implement RMSNorm”), representative API usage, and a one-shot vector-addition example in both PyTorch and the target kernel language.
- Kernel Synthesis and Compilation: The generator issues . Failures during compilation trigger feedback via compiler error output , appended to the prompt, yielding upon subsequent invocation.
- Runtime and Functional Testing: Synthesized models are evaluated on random test inputs against a PyTorch reference. Mismatches are incorporated into the next prompt until correctness is achieved.
Optimization Pass
- Profiling Data Acquisition:
- On CUDA, profiling uses cudaProfilerApi and Nsight Systems, producing CSVs of per-kernel execution time , memory bandwidth , occupancy , and launch overhead .
- On Metal, this stage leverages Xcode Instruments automation to capture screenshots of per-kernel memory and timeline data.
- Performance Analysis and Recommendation: Profiling artifacts are passed to , which deduces bottlenecks and returns a concise, precise recommendation (e.g., “increase threadgroup size from 64 to 256 and unroll inner loop by factor of 4”).
- Prompt Update for Generator: This recommendation is added as a directive to the subsequent prompt, guiding in the next iteration.
Performance Metrics:
Baseline () and generated kernel () run times are used to compute per-problem speedup:
The system evaluates the fraction of correct and fast kernels:
For example, is the rate of on-par performance, and is the superior performance rate.
The objectives can be informally described as minimizing end-to-end latency , subject to thresholds on occupancy and bandwidth (Sereda et al., 17 Nov 2025).
3. Cross-Platform Synthesis and Knowledge Transfer
A signature feature is KForge’s platform-agnostic design and its practical realization of knowledge transfer:
- Reference Implementations:
When synthesizing kernels for Apple Metal, the prompt includes an existing CUDA implementation. The generator agent demonstrates the ability to analogize between semantics—mapping CUDA’s block and thread indices (e.g., , , ) to Metal’s threadgroup and synchronization primitives.
- Adaptation Example:
Kernel translation involves parsing the CUDA AST and systematically substituting CUDA intrinsics and launch config for Metal equivalents, such as:
1 2 3 4 5 6 7 8 9 10 |
function translate_cuda_to_metal(cuda_src):
parse cuda_src into AST
for each kernel in AST:
rename __global__ → kernel
replace blockIdx.x, blockDim.x → threadgroup_position_in_grid.x, threads_per_threadgroup.x
replace threadIdx.x → thread_position_in_threadgroup.x
replace __syncthreads() → threadgroup_barrier(mem_flags::mem_threadgroup)
map exp() → fast::exp()
map __shfl_xor_sync() → simd_shuffle()
emit Metal‐flavored source |
- Effectiveness:
Metal kernels synthesized with a CUDA reference attain up to 20–30% higher correctness rates and require fewer refinement iterations compared to prompts without a reference implementation.
This cross-system capability empirically validates the ability of LLM agents to transfer platform-dependent knowledge through a shared interface (Sereda et al., 17 Nov 2025).
4. Evaluation Methodology and Empirical Results
KForge is benchmarked on KernelBench, comprising 250 diverse PyTorch module workloads, partitioned by complexity:
- Level 1: simple kernels,
- Level 2: fusion candidates,
- Level 3: full networks.
The testbed utilizes four NVIDIA H100 SXM5 GPUs for CUDA and five Mac Studio M4 Max 32-core GPU nodes for Metal. Thirty Metal-incompatible ops are excluded, yielding 220 valid Metal targets. Baselines include PyTorch eager mode and torch.compile where available.
Measurement Protocol
Each kernel undergoes 100 execution runs per trial (with 10 warmups), resetting the compilation context each time. Median values of and are collected.
Quantitative Results
- CUDA Iterative Refinement (openai-gpt-5 baseline):
- Level 1:
- Level 2:
- Level 3: ,
- CUDA + Profiling Info (Level 3):
- improves from 7% to 11%
- Metal with CUDA Reference + Profiling:
| Model | L1 | L2 | L3 | L1 | L2 | L3 | |---------------|-------------------|--------------------|--------------------|------------------------|------------------------|------------------------| | claude-opus-4 | 13% | 50% | 16% | 6.6% | 15.4% | 2.0% | | openai-o3 | 42% | 47% | 28% | 8.8% | 11.5% | 4.0% | | openai-gpt-5 | 54% | 76% | 44% | 16.5% | 25.6% | 4.0% |
A case paper on a SqueezeNetFire batch-size sweep reveals that at smaller batches (8–32), KForge achieves 1.2–3.9 speedup over eager and torch.compile, while for larger batches (64–128), torch.compile surpasses KForge on throughput, though KForge continues to match or exceed eager mode performance. Percentages of problems with speedup above thresholds are used for overall reporting (Sereda et al., 17 Nov 2025).
5. Analysis of Contributions, Limitations, and Prospective Directions
Main Contributions
- Iterative LLM-Based Refinement: KForge replicates expert kernel engineering via a two-phase, LLM-driven refinement loop integrating both correctness and structured performance optimization.
- Agentic Modularity and Multimodal Inputs: The strict division between code generation and performance analysis enables systematic ingestion of both structured data (CSV/API) and unstructured inputs (GUI screenshots).
- Cross-Platform Generality: KForge is empirically shown to synthesize correct, performant kernels across fundamentally different accelerators (NVIDIA CUDA, Apple Metal), leveraging single-shot knowledge transfer.
Limitations
- Profiling data complexity presents challenges: deriving optimal recommendations from hundreds of raw metrics can result in noisy or suboptimal proposals.
- Iterative refinement tends to yield incremental optimizations; discovering new algorithmic strategies remains difficult.
- Default tensor sizes are relatively small in benchmarks, making launch overhead disproportionately prominent and masking downstream compute/memory improvements.
Future Directions
- Support for Training Synthesis: Extending the framework to generate backward passes for training workloads.
- Leveraging IR Layers: Conditioning on compiler IRs (e.g., Triton, LLVM) to exploit expert-system optimizations.
- Formal Verification Integration: Ensuring generalization across tensor shapes and data types via formal guarantees.
- Richer Performance Objectives: Developing finer-grained, continuous speedup or roofline-guided reward metrics.
In summation, KForge represents the first LLM-driven, platform-agnostic kernel synthesis system to integrate collaborative agentic reasoning between code generation and profiling analysis. It delivers high correctness rates (over 90% on simple/fusion kernels), substantial speedups (over 20% of Level 3 kernels exceed 1.5 baseline performance), and achieves cross-platform transfer with single-shot references, obviating the need for hand-crafted accelerator-specific inputs (Sereda et al., 17 Nov 2025).