TritonForge GPU Kernel Optimization

Updated 30 January 2026

TritonForge is a profiling-guided, modular framework that automates LLM-driven optimizations of Triton GPU kernels using static analysis and runtime feedback.
It integrates dynamic profiling data with iterative code transformation to achieve expert-competitive throughput improvements in ML and HPC workloads.
Its closed-loop system combines kernel analysis, test generation, and automated error remediation to deliver significant speedups with minimal manual intervention.

TritonForge is a profiling-guided, modular framework for automated optimization of Triton GPU kernels in machine learning and high-performance computing workloads. By integrating static kernel analysis, runtime profiling, and iterative code transformation driven by LLMs, TritonForge formalizes and streamlines the process of accelerating Triton kernels for varied architectures. Central to this approach is a closed evaluation loop, leveraging dynamic feedback from hardware-level metrics to guide code-level optimizations and achieve expert-competitive throughput without manual intervention (Li et al., 9 Dec 2025).

1. Architecture and Workflow

TritonForge's system architecture comprises several discrete, interoperable modules, each addressing a distinct stage of the optimization pipeline. The core workflow proceeds as follows:

Kernel Analysis: The system ingests a Triton kernel $K$ and hardware profile $H$ , inspecting the code for tiling parameters, memory access patterns, and loop structures.
Test Generation: An LLM agent generates unit and performance tests, annotating input tensors and marking relevant regions via NVTX labels for robust profiling.
Runtime Profiling: The kernel is executed under profiling tools (e.g., Nsight Compute), producing report $R$ comprising latency, occupancy, bandwidth utilization, and stall metrics.
Transformation Engine: A second LLM agent, equipped with fault remediation capabilities, proposes transformed kernel variants $K'$ , informed by $H$ , $K$ , and $R$ .
Build and Evaluation: The transformed kernel is compiled and executed; errors, if encountered, are remediated through further model prompts. New runtime metrics $R'$ are collected.
Arbiter and Hint Refinement: The Arbiter compares $R'$ to the best-so-far metrics $R^*$ and either accepts, iterates, or terminates search. Performance deltas $\Delta(R', R)$ are converted into targeted hints for the next transformation cycle.

This pipeline is represented by the following ASCII diagram:

┌────────────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│ 1. Kernel K    │→──▶│ 2. Test Gen │→──▶│ 3. Profile   │→──▶│ 4. Arbiter  │
└────────────────┘    └─────────────┘    └──────────────┘    └─────────────┘
                                                          ▲
                                                          │
                              ┌───────────────┐           │
                              │ 5. Transformer│◀──────────┘
                              └───────────────┘

2. Profiling Metrics and Bottleneck Diagnosis

Optimization decisions in TritonForge are driven by quantitative profiling data:

Latency: Kernel-only duration, $T_{lat} = \mathrm{mean}_n[t_{end}-t_{start}]$ .
Bandwidth Utilization: $BW_{util} = \frac{\text{measured bytes}}{\text{elapsed time} \times \text{peak bandwidth}}$
Compute Utilization: $Compute_{util} = \frac{\text{executed ops}}{\text{elapsed time} \times \text{peak compute}}$
Occupancy: $\text{occupancy} = \frac{\# \text{active warps per SM}}{\# \text{max warps per SM}}$ , e.g.,

$\text{occupancy} = \frac{(\text{resident CTAs} \times \text{warps per CTA} \times 32)}{(64 \times 32)}$

Stall Reason Ranking: Nsight metrics such as “DRAM memory efficiency” and detailed stall categories (“membar,” “pipe_alu”) are ranked to prioritize transformation targets.

Bottleneck identification leverages these metrics:

$BW_{util} \ll 100\%$ signals a memory-bound kernel.
$Compute_{util} \ll 100\%$ , and low occupancy imply compute-bound kernels or register pressure issues.

3. Code Transformation Methods

TritonForge utilizes a set of parameterized, hardware-aware code transformation strategies, selection of which is governed by LLM-driven cost models and heuristics:

A. Tiling

Loop tiling is employed to increase cache and register reuse:

Cost heuristic:

$\text{cost}_{tile}(m, n) \approx \alpha \cdot \frac{tile_M \cdot tile_N}{\text{register usage}} + \beta \cdot \frac{tile_M \cdot tile_N}{\text{cache capacity}}$

Pseudocode transformation:

# Before
for i in range(0, M):
  for j in range(0, N):
    C[i,j] += A[i,k]*B[k,j]

# After, tile size (T_M×T_N)
for i0 in range(0, M, T_M):
  for j0 in range(0, N, T_N):
    A_tile = A[i0:i0+T_M, k0:k0+T_K]
    B_tile = B[k0:k0+T_K, j0:j0+T_N]
    C_tile = tl.dot(A_tile, B_tile)

B. Memory Coalescing

Transforms strided accesses to contiguous block loads for improved throughput:

Transformation:

# Before
idx = offs_m * stride_row + offs_k * stride_col
x = tl.load(src_ptr + idx)

# After
tmp = tl.load_blocked(src_ptr + offs_m*accum + intrawarp_idx)
# Fully coalesced

C. Prefetching and Pipelining

Applies num_stages and num_warps to overlap memory operations with computation, favoring at least two stages if bandwidth stalls exceed 25%.

D. Vectorization

Replaces scalar loads with vectorized operations:

Example: tl.load(..., mask=..., num_warps=..., vector_width=4)
Constrained by memory alignment and register availability.

E. Autotuning

Adorns kernels with @triton.autotune, enabling runtime search over BLOCK_SIZE and num_warps for configuration space optimization.

4. LLM Integration

The framework is model-agnostic, interfacing with any code-generation LLM via standardized prompt templates and structured JSON I/O. Gemini-2.5-Pro was predominantly used but may be replaced by other code-centric models such as Codex or ClaudeSonnet.

Test Generation: LLM prompt to create three unit tests with NVTX-labeled bulk/edge cases.
Kernel Optimization: LLM prompt provides hardware specs and profiling results, requesting code variants using tiling, prefetching, or coalesced loads.
Remediation: On compilation or runtime errors, LLMs receive error tracebacks and generate repairs, such as resolving undefined names or correcting tensor shapes.

5. Iterative Evaluation and Search Loop

Optimization progresses through an iterative loop, formalized in the following algorithmic pseudocode:

Input: H, initial kernel K, baseline profile R
K★, R★ ← K, R
for iter in 1…max_iters:
  K′ ← LLM_Proposal(H, K, R)
  success ← try_compile_and_run(K′)
  if not success:
    K′ ← LLM_Remediation(K′, logs)
    if compile still fails → break
  R′ ← Profile(K′)
  decision ← Arbiter(R′, R★)
  if decision == accept:
    K★←K′; R★←R′
  if decision == finish:
    break
  hint ← make_hint(R, R′)
  K ← hint  # feed refined prompt
  R ← R′
return K★, R★

Round-level feedback (performance deltas and error traces) dynamically inform each subsequent transformation proposal, yielding either acceptance, iteration, or exit.

6. Empirical Evaluation and Benchmarks

TritonForge was assessed on the TritonBench suite, encompassing 184 real-world kernels across matrix operations, convolutions, and attention mechanisms. The main results are:

Success criteria: Speedup $\geq 1.05\times$ over baseline.
Overall success rate: $42.7\%$ (131/184 kernels compiled and profiled end-to-end).
Mean speedup (successful kernels): $\mu = 1.76\times$ with $95\%$ confidence interval $[1.70\times, 1.82\times]$ .
Maximum observed speedup: Up to $5\times$ (excluding anomalous outliers).
Success by kernel size:

Category	Number	Success %	Avg Speedup
Q1 (small)	26	42.3%	2.25×
Q2	35	40.0%	1.40×
Q3	35	42.8%	1.62×
Q4 (large)	22	45.5%	2.05×
Tails	13	61.5%	1.60×
Overall	131	42.7%	1.76×

Additional findings:

$60\%$ of successful kernels achieved speedup within $[1.05\times, 1.2\times]$ .
Approximately $10\%$ exceeded $2\times$ but were below $5\times$ .
Ablation studies determined profiling-guided optimization doubled the success rate compared to LLM-only approaches lacking feedback, and iterative refinement outperformed one-shot tuning for complex kernels.

7. Limitations and Prospects

Identified limitations include:

Exploration inefficiency: LLMs tend to revisit syntactically similar variants, marginalizing impactful transformations.
Diminishing returns: Beyond the third iteration, additional optimization rounds yield a $\sim10\%$ decrease in incremental success rate.
Algorithmic stagnation: LLMs predominantly propose superficial code changes rather than fundamentally novel algorithmic schemes.
Latency bottlenecks: Overall runtime per kernel averages $\sim$ 20 minutes, primarily attributable to LLM inference overhead.

Future enhancements proposed:

Early convergence detection via adaptive stopping rules responsive to diminishing speedup.
Diversity-driven prompting to induce exploration of orthogonal transformations.
Integration of lightweight, learned cost predictors to eliminate low-value candidate proposals pre-profile.
Hybridization with genetic or evolutionary search paradigms to complement LLM suggestions.
Reduction of LLM latency through lower-inference models or edge deployment.

TritonForge empirically demonstrates that leveraging hardware-centric feedback in a model-agnostic, LLM-driven code optimization loop produces material performance gains on Triton kernels, substantiating the framework as a foundation for subsequent research in automated GPU kernel optimization (Li et al., 9 Dec 2025).

Markdown Upgrade to Chat

References (1)

TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TritonForge.