CUDA-L2: Automated HGEMM CUDA Kernel Optimization
- CUDA-L2 is an automated system that boosts FP16 matrix multiplication performance by integrating large language models with reinforcement learning.
- It employs a tightly coupled code synthesis–profiling–policy gradient loop to explore vast CUDA kernel configuration spaces and achieve state-of-the-art speedups.
- Evaluated on NVIDIA A100 GPUs, CUDA-L2 outperforms standard libraries by up to 28.7% in server mode, underscoring its impact on LLM inference and training.
CUDA-L2 is a system for the automated optimization of half-precision General Matrix Multiply (HGEMM) CUDA kernels, surpassing the performance of NVIDIA's extensively tuned matmul libraries. Developed around the integration of LLMs and reinforcement learning (RL), CUDA-L2 demonstrates that LLM-guided RL can systematically exceed state-of-the-art in performance-critical GPU workloads, notably those underpinning LLM inference and training (Su et al., 2 Dec 2025). The system embodies a tightly coupled code synthesis–profiling–policy gradient feedback loop, enabling effective exploration of kernel configuration spaces far beyond what is practical with manual tuning or traditional heuristics.
1. HGEMM Optimization Problem
HGEMM computes for matrices , , , with , so , using FP16 arithmetic. For modern LLM pipelines (inference and training), HGEMM is the dominant operation, often exceeding of total FLOPs, making its efficient GPU execution critical for large-scale deployments. However, the problem is non-trivial: optimal tiling, pipelining, and memory orchestration depend acutely on dimension triples , and the cardinality of the practical configuration space (e.g., , yielding cases) precludes exhaustive manual optimization (Su et al., 2 Dec 2025).
2. CUDA-L2 System Architecture
CUDA-L2 fuses a pretrained LLM with a reinforcement learning loop to propose, compile, and empirically assess CUDA kernels. The workflow proceeds as follows:
- Continued Pretraining: The LLM is initialized then further pretrained on a corpus of CUDA kernels (CUTLASS, CuTe, PyTorch, web data, inline docs).
- General-Kernel RL: The LLM is exposed to approximately 1,000 CUDA kernel types in a contrastive RL regime where measured speedup informs reward.
- HGEMM-Specific RL: Focused RL over $1,000$ distinct configurations, guiding the policy toward high-throughput code generation for matrix multiplication.
Each RL episode comprises: contextualized prompt construction (encoding current dimensions, baseline/Nsight Compute profile metrics, relevant code/documentation snippets, and contrastive code-latency examples), LLM code generation (yielding kernel parameters such as block/tile sizes and pipeline stages), compilation and execution, result validation, reward computation (reflecting speedup, numerical accuracy, code brevity), and a GRPO-style policy update (Su et al., 2 Dec 2025).
3. Reinforcement Learning Formulation
The key RL components are as follows:
- State: Encodes problem dimensions , NCU profiling outputs (L2 bandwidth, SM occupancy, cache hit rates), and historical kernel performances.
- Action: A vector of kernel choices: (BM, BN, BK), number of pipeline stages (), various memory swizzle and prefetch parameters, etc.
- Reward: Averaged speedup over timed runs, penalized for numerical deviation from FP32 reference (weighted by ), and penalized for code length (weighted by ):
Non-compiling or incorrect kernels incur a heavy negative reward. Valid kernels propagate their empirical performance into the LLM's policy update via GRPO (Su et al., 2 Dec 2025).
4. LLM Integration and Prompt Design
Prompt generation is retrieval-augmented and contrastive:
- The prompt specifies the target HGEMM dimensions, highlights baseline NCU profiling data (e.g., L2 bandwidth, SM occupancy), includes relevant code idioms (from CUTLASS/CuTe), and supplies recent kernel examples with associated timing.
- The LLM generates CUDA C++ code (typically a global kernel with explicit block/tile/pipeline structure). The generated code is automatically compiled and benchmarked. Execution metrics and reward outcomes are appended to the RL context (Su et al., 2 Dec 2025).
5. Experimental Setup and Evaluation
The evaluation is conducted on NVIDIA A100 (Ampere) GPUs, using CUDA 11.8, cuBLAS 11.10, and PyTorch 2.1.0, with Nsight Compute for profiling. Test space covers 1,000 triplets (powers of two from 64 up to 16,384). Two inference modes are used:
- Offline: Consecutive kernel launches with no idle time.
- Server: Randomized launch intervals to simulate on-demand serving (intervals excluded from timing).
CUDA-L2 is compared to four baselines:
- torch.matmul: PyTorch dispatch to cuBLAS.
- cuBLAS: Both NN and TN layouts, reporting max.
- cuBLASLt-heuristic: Rank-0 result from cublasLtMatmulAlgoGetHeuristic.
- cuBLASLt-AutoTuning: Exhaustive algorithm search, selecting fastest.
| Mode | torch.matmul | cuBLAS | cuBLASLt-heuristic | cuBLASLt-AutoTuning |
|---|---|---|---|---|
| Offline | +22.0% | +19.2% | +16.8% | +11.4% |
| Server | +28.7% | +26.0% | +22.4% | +15.9% |
Speedup is calculated as . In a hybrid “best-of-two” setup leveraging the faster of CUDA-L2 or baseline, the speedup for cuBLASLt-AutoTuning increases from +11.4% to +13.2% (Su et al., 2 Dec 2025).
6. Insights from Automated Kernel Search
Analysis of the discovered kernel configurations reveals:
- Tiling Parameters: BM/BN (block tile sizes for M/N) exhibit correlation with their dimensions, while BK (K-tiling) shows only weak correlation (), indicative of a trade-off between register pressure and pipeline depth.
- Pipelining: Number of pipeline stages grows with problem size; small use 2–3 stages, while uses 6+.
- Swizzle and Prefetch: Block swizzling is engaged for large problems (≥ ops) with stride proportional to problem size. The system autonomously discovers optimizations such as zero-padding for non-divisible tile sizes, double buffering via register ping-pong, multi-iteration prefetching, wide register-to-shared copies, and staggered prefetch order.
- Ablation: Removing advanced tactics (pipelining, swizzle) reduces performance by 5–10% in critical regimes (Su et al., 2 Dec 2025).
7. Limitations and Prospective Directions
CUDA-L2 currently targets FP16 GEMM on Ampere (A100) GPU architecture. Future extensions will address additional architectures (Hopper H100, Ada Lovelace RTX 40xx, Blackwell B200) and other precisions (TF32, BF16). Hierarchical RL and meta-learning approaches are under investigation to further reduce adaptation time to novel hardware or precision domains. A plausible implication is that the auto-tuning and LLM-generated kernel paradigm may generalize to other domain-critical GPU primitives, including convolutions and attention mechanisms (Su et al., 2 Dec 2025).
CUDA-L2 represents a significant step in the automated optimization of GPU kernels, demonstrating the capacity of LLM-guided RL to outperform established vendor libraries for highly structured computational workloads in high-performance computing and machine learning (Su et al., 2 Dec 2025).