Papers
Topics
Authors
Recent
2000 character limit reached

CUDA-L2: Automated HGEMM CUDA Kernel Optimization

Updated 5 December 2025
  • CUDA-L2 is an automated system that boosts FP16 matrix multiplication performance by integrating large language models with reinforcement learning.
  • It employs a tightly coupled code synthesis–profiling–policy gradient loop to explore vast CUDA kernel configuration spaces and achieve state-of-the-art speedups.
  • Evaluated on NVIDIA A100 GPUs, CUDA-L2 outperforms standard libraries by up to 28.7% in server mode, underscoring its impact on LLM inference and training.

CUDA-L2 is a system for the automated optimization of half-precision General Matrix Multiply (HGEMM) CUDA kernels, surpassing the performance of NVIDIA's extensively tuned matmul libraries. Developed around the integration of LLMs and reinforcement learning (RL), CUDA-L2 demonstrates that LLM-guided RL can systematically exceed state-of-the-art in performance-critical GPU workloads, notably those underpinning LLM inference and training (Su et al., 2 Dec 2025). The system embodies a tightly coupled code synthesis–profiling–policy gradient feedback loop, enabling effective exploration of kernel configuration spaces far beyond what is practical with manual tuning or traditional heuristics.

1. HGEMM Optimization Problem

HGEMM computes C=αAB+βCC = \alpha A B + \beta C for matrices ARM×KA \in \mathbb{R}^{M \times K}, BRK×NB \in \mathbb{R}^{K \times N}, CRM×NC \in \mathbb{R}^{M \times N}, with α=1\alpha=1, β=0\beta=0 so C=ABC = AB, using FP16 arithmetic. For modern LLM pipelines (inference and training), HGEMM is the dominant operation, often exceeding 70%70\% of total FLOPs, making its efficient GPU execution critical for large-scale deployments. However, the problem is non-trivial: optimal tiling, pipelining, and memory orchestration depend acutely on dimension triples (M,N,K)(M,N,K), and the cardinality of the practical configuration space (e.g., M,N,K{64,128,,16384}M,N,K \in \{64,128,\dots,16384\}, yielding 10310^3 cases) precludes exhaustive manual optimization (Su et al., 2 Dec 2025).

2. CUDA-L2 System Architecture

CUDA-L2 fuses a pretrained LLM with a reinforcement learning loop to propose, compile, and empirically assess CUDA kernels. The workflow proceeds as follows:

  • Continued Pretraining: The LLM is initialized then further pretrained on a corpus of CUDA kernels (CUTLASS, CuTe, PyTorch, web data, inline docs).
  • General-Kernel RL: The LLM is exposed to approximately 1,000 CUDA kernel types in a contrastive RL regime where measured speedup informs reward.
  • HGEMM-Specific RL: Focused RL over $1,000$ distinct (M,N,K)(M,N,K) configurations, guiding the policy toward high-throughput code generation for matrix multiplication.

Each RL episode comprises: contextualized prompt construction (encoding current dimensions, baseline/Nsight Compute profile metrics, relevant code/documentation snippets, and contrastive code-latency examples), LLM code generation (yielding kernel parameters such as block/tile sizes and pipeline stages), compilation and execution, result validation, reward computation (reflecting speedup, numerical accuracy, code brevity), and a GRPO-style policy update (Su et al., 2 Dec 2025).

3. Reinforcement Learning Formulation

The key RL components are as follows:

  • State: Encodes problem dimensions (M,N,K)(M,N,K), NCU profiling outputs (L2 bandwidth, SM occupancy, cache hit rates), and historical kernel performances.
  • Action: A vector of kernel choices: (BM, BN, BK), number of pipeline stages (nstagen_\mathrm{stage}), various memory swizzle and prefetch parameters, etc.
  • Reward: Averaged speedup over NN timed runs, penalized for numerical deviation from FP32 reference (weighted by α\alpha), and penalized for code length (weighted by β\beta):

r(code)=1Ni=1N[trefitexeciα  diffi]β  len(code)r(\text{code}) = \frac{1}{N}\sum_{i=1}^N \left[ \frac{t^i_{\mathrm{ref}}}{t^i_{\mathrm{exec}}} - \alpha \; \mathrm{diff}^i \right] - \beta \; \mathrm{len}(\text{code})

Non-compiling or incorrect kernels incur a heavy negative reward. Valid kernels propagate their empirical performance into the LLM's policy update via GRPO (Su et al., 2 Dec 2025).

4. LLM Integration and Prompt Design

Prompt generation is retrieval-augmented and contrastive:

  • The prompt specifies the target HGEMM dimensions, highlights baseline NCU profiling data (e.g., L2 bandwidth, SM occupancy), includes relevant code idioms (from CUTLASS/CuTe), and supplies recent kernel examples with associated timing.
  • The LLM generates CUDA C++ code (typically a global kernel with explicit block/tile/pipeline structure). The generated code is automatically compiled and benchmarked. Execution metrics and reward outcomes are appended to the RL context (Su et al., 2 Dec 2025).

5. Experimental Setup and Evaluation

The evaluation is conducted on NVIDIA A100 (Ampere) GPUs, using CUDA 11.8, cuBLAS 11.10, and PyTorch 2.1.0, with Nsight Compute for profiling. Test space covers 1,000 (M,N,K)(M,N,K) triplets (powers of two from 64 up to 16,384). Two inference modes are used:

  • Offline: Consecutive kernel launches with no idle time.
  • Server: Randomized launch intervals to simulate on-demand serving (intervals excluded from timing).

CUDA-L2 is compared to four baselines:

  • torch.matmul: PyTorch dispatch to cuBLAS.
  • cuBLAS: Both NN and TN layouts, reporting max.
  • cuBLASLt-heuristic: Rank-0 result from cublasLtMatmulAlgoGetHeuristic.
  • cuBLASLt-AutoTuning: Exhaustive algorithm search, selecting fastest.
Mode torch.matmul cuBLAS cuBLASLt-heuristic cuBLASLt-AutoTuning
Offline +22.0% +19.2% +16.8% +11.4%
Server +28.7% +26.0% +22.4% +15.9%

Speedup is calculated as [TbaselineTCUDAL2]/Tbaseline×100%[T_\mathrm{baseline} - T_\mathrm{CUDA-L2}]/T_\mathrm{baseline} \times 100\%. In a hybrid “best-of-two” setup leveraging the faster of CUDA-L2 or baseline, the speedup for cuBLASLt-AutoTuning increases from +11.4% to +13.2% (Su et al., 2 Dec 2025).

Analysis of the discovered kernel configurations reveals:

  • Tiling Parameters: BM/BN (block tile sizes for M/N) exhibit ρ0.650.70\rho{\approx}0.65-0.70 correlation with their dimensions, while BK (K-tiling) shows only weak correlation (ρ0.25\rho{\approx}0.25), indicative of a trade-off between register pressure and pipeline depth.
  • Pipelining: Number of pipeline stages nstagen_\mathrm{stage} grows with problem size; small K128K \leq 128 use 2–3 stages, while K>8KK>8K uses 6+.
  • Swizzle and Prefetch: Block swizzling is engaged for large problems (≥ 2272^{27} ops) with stride proportional to problem size. The system autonomously discovers optimizations such as zero-padding for non-divisible tile sizes, double buffering via register ping-pong, multi-iteration prefetching, wide register-to-shared copies, and staggered prefetch order.
  • Ablation: Removing advanced tactics (pipelining, swizzle) reduces performance by 5–10% in critical regimes (Su et al., 2 Dec 2025).

7. Limitations and Prospective Directions

CUDA-L2 currently targets FP16 GEMM on Ampere (A100) GPU architecture. Future extensions will address additional architectures (Hopper H100, Ada Lovelace RTX 40xx, Blackwell B200) and other precisions (TF32, BF16). Hierarchical RL and meta-learning approaches are under investigation to further reduce adaptation time to novel hardware or precision domains. A plausible implication is that the auto-tuning and LLM-generated kernel paradigm may generalize to other domain-critical GPU primitives, including convolutions and attention mechanisms (Su et al., 2 Dec 2025).


CUDA-L2 represents a significant step in the automated optimization of GPU kernels, demonstrating the capacity of LLM-guided RL to outperform established vendor libraries for highly structured computational workloads in high-performance computing and machine learning (Su et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to CUDA-L2.