TritonForge GPU Kernel Optimization
- TritonForge is a profiling-guided, modular framework that automates LLM-driven optimizations of Triton GPU kernels using static analysis and runtime feedback.
- It integrates dynamic profiling data with iterative code transformation to achieve expert-competitive throughput improvements in ML and HPC workloads.
- Its closed-loop system combines kernel analysis, test generation, and automated error remediation to deliver significant speedups with minimal manual intervention.
TritonForge is a profiling-guided, modular framework for automated optimization of Triton GPU kernels in machine learning and high-performance computing workloads. By integrating static kernel analysis, runtime profiling, and iterative code transformation driven by LLMs, TritonForge formalizes and streamlines the process of accelerating Triton kernels for varied architectures. Central to this approach is a closed evaluation loop, leveraging dynamic feedback from hardware-level metrics to guide code-level optimizations and achieve expert-competitive throughput without manual intervention (Li et al., 9 Dec 2025).
1. Architecture and Workflow
TritonForge's system architecture comprises several discrete, interoperable modules, each addressing a distinct stage of the optimization pipeline. The core workflow proceeds as follows:
- Kernel Analysis: The system ingests a Triton kernel and hardware profile , inspecting the code for tiling parameters, memory access patterns, and loop structures.
- Test Generation: An LLM agent generates unit and performance tests, annotating input tensors and marking relevant regions via NVTX labels for robust profiling.
- Runtime Profiling: The kernel is executed under profiling tools (e.g., Nsight Compute), producing report comprising latency, occupancy, bandwidth utilization, and stall metrics.
- Transformation Engine: A second LLM agent, equipped with fault remediation capabilities, proposes transformed kernel variants , informed by , , and .
- Build and Evaluation: The transformed kernel is compiled and executed; errors, if encountered, are remediated through further model prompts. New runtime metrics are collected.
- Arbiter and Hint Refinement: The Arbiter compares to the best-so-far metrics and either accepts, iterates, or terminates search. Performance deltas are converted into targeted hints for the next transformation cycle.
This pipeline is represented by the following ASCII diagram:
1 2 3 4 5 6 7 8 |
┌────────────────┐ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ 1. Kernel K │→──▶│ 2. Test Gen │→──▶│ 3. Profile │→──▶│ 4. Arbiter │
└────────────────┘ └─────────────┘ └──────────────┘ └─────────────┘
▲
│
┌───────────────┐ │
│ 5. Transformer│◀──────────┘
└───────────────┘ |
2. Profiling Metrics and Bottleneck Diagnosis
Optimization decisions in TritonForge are driven by quantitative profiling data:
- Latency: Kernel-only duration, .
- Bandwidth Utilization:
- Compute Utilization:
- Occupancy: , e.g.,
- Stall Reason Ranking: Nsight metrics such as “DRAM memory efficiency” and detailed stall categories (“membar,” “pipe_alu”) are ranked to prioritize transformation targets.
Bottleneck identification leverages these metrics:
- signals a memory-bound kernel.
- , and low occupancy imply compute-bound kernels or register pressure issues.
3. Code Transformation Methods
TritonForge utilizes a set of parameterized, hardware-aware code transformation strategies, selection of which is governed by LLM-driven cost models and heuristics:
A. Tiling
Loop tiling is employed to increase cache and register reuse:
- Cost heuristic:
Pseudocode transformation:
1 2 3 4 5 6 7 8 9 10 11 |
# Before for i in range(0, M): for j in range(0, N): C[i,j] += A[i,k]*B[k,j] # After, tile size (T_M×T_N) for i0 in range(0, M, T_M): for j0 in range(0, N, T_N): A_tile = A[i0:i0+T_M, k0:k0+T_K] B_tile = B[k0:k0+T_K, j0:j0+T_N] C_tile = tl.dot(A_tile, B_tile) |
B. Memory Coalescing
Transforms strided accesses to contiguous block loads for improved throughput:
- Transformation:
1 2 3 4 5 6 7 |
# Before idx = offs_m * stride_row + offs_k * stride_col x = tl.load(src_ptr + idx) # After tmp = tl.load_blocked(src_ptr + offs_m*accum + intrawarp_idx) # Fully coalesced |
C. Prefetching and Pipelining
Applies num_stages and num_warps to overlap memory operations with computation, favoring at least two stages if bandwidth stalls exceed 25%.
D. Vectorization
Replaces scalar loads with vectorized operations:
- Example:
tl.load(..., mask=..., num_warps=..., vector_width=4) - Constrained by memory alignment and register availability.
E. Autotuning
Adorns kernels with @triton.autotune, enabling runtime search over BLOCK_SIZE and num_warps for configuration space optimization.
4. LLM Integration
The framework is model-agnostic, interfacing with any code-generation LLM via standardized prompt templates and structured JSON I/O. Gemini-2.5-Pro was predominantly used but may be replaced by other code-centric models such as Codex or ClaudeSonnet.
- Test Generation: LLM prompt to create three unit tests with NVTX-labeled bulk/edge cases.
- Kernel Optimization: LLM prompt provides hardware specs and profiling results, requesting code variants using tiling, prefetching, or coalesced loads.
- Remediation: On compilation or runtime errors, LLMs receive error tracebacks and generate repairs, such as resolving undefined names or correcting tensor shapes.
5. Iterative Evaluation and Search Loop
Optimization progresses through an iterative loop, formalized in the following algorithmic pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Input: H, initial kernel K, baseline profile R K★, R★ ← K, R for iter in 1…max_iters: K′ ← LLM_Proposal(H, K, R) success ← try_compile_and_run(K′) if not success: K′ ← LLM_Remediation(K′, logs) if compile still fails → break R′ ← Profile(K′) decision ← Arbiter(R′, R★) if decision == accept: K★←K′; R★←R′ if decision == finish: break hint ← make_hint(R, R′) K ← hint # feed refined prompt R ← R′ return K★, R★ |
Round-level feedback (performance deltas and error traces) dynamically inform each subsequent transformation proposal, yielding either acceptance, iteration, or exit.
6. Empirical Evaluation and Benchmarks
TritonForge was assessed on the TritonBench suite, encompassing 184 real-world kernels across matrix operations, convolutions, and attention mechanisms. The main results are:
- Success criteria: Speedup over baseline.
- Overall success rate: (131/184 kernels compiled and profiled end-to-end).
- Mean speedup (successful kernels): with confidence interval .
- Maximum observed speedup: Up to (excluding anomalous outliers).
- Success by kernel size:
| Category | Number | Success % | Avg Speedup |
|---|---|---|---|
| Q1 (small) | 26 | 42.3% | 2.25× |
| Q2 | 35 | 40.0% | 1.40× |
| Q3 | 35 | 42.8% | 1.62× |
| Q4 (large) | 22 | 45.5% | 2.05× |
| Tails | 13 | 61.5% | 1.60× |
| Overall | 131 | 42.7% | 1.76× |
Additional findings:
- of successful kernels achieved speedup within .
- Approximately exceeded but were below .
- Ablation studies determined profiling-guided optimization doubled the success rate compared to LLM-only approaches lacking feedback, and iterative refinement outperformed one-shot tuning for complex kernels.
7. Limitations and Prospects
Identified limitations include:
- Exploration inefficiency: LLMs tend to revisit syntactically similar variants, marginalizing impactful transformations.
- Diminishing returns: Beyond the third iteration, additional optimization rounds yield a decrease in incremental success rate.
- Algorithmic stagnation: LLMs predominantly propose superficial code changes rather than fundamentally novel algorithmic schemes.
- Latency bottlenecks: Overall runtime per kernel averages 20 minutes, primarily attributable to LLM inference overhead.
Future enhancements proposed:
- Early convergence detection via adaptive stopping rules responsive to diminishing speedup.
- Diversity-driven prompting to induce exploration of orthogonal transformations.
- Integration of lightweight, learned cost predictors to eliminate low-value candidate proposals pre-profile.
- Hybridization with genetic or evolutionary search paradigms to complement LLM suggestions.
- Reduction of LLM latency through lower-inference models or edge deployment.
TritonForge empirically demonstrates that leveraging hardware-centric feedback in a model-agnostic, LLM-driven code optimization loop produces material performance gains on Triton kernels, substantiating the framework as a foundation for subsequent research in automated GPU kernel optimization (Li et al., 9 Dec 2025).