CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning (2512.02551v1)

Published 2 Dec 2025 in cs.LG and cs.AI

Abstract: In this paper, we propose CUDA-L2, a system that combines LLMs and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

Summary

The paper demonstrates that an RL-guided LLM pipeline can auto-optimize CUDA HGEMM kernels, achieving up to 22% speedup over cuBLAS.
It integrates multi-stage reinforcement learning and contrastive methods to balance speed, correctness, and code brevity in kernel generation.
Benchmarking on 1,000 configurations shows improved performance in both offline and server settings, highlighting robust generalization capabilities.

CUDA-L2: Reinforcement Learning-Guided Optimization of Matrix Multiplication Kernels

Overview

This work introduces CUDA-L2, an automated kernel optimization system that combines LLMs and reinforcement learning (RL) to systematically surpass the performance of NVIDIA's cuBLAS and related libraries for Half-precision General Matrix Multiply (HGEMM) across a comprehensive set of matrix sizes. The approach addresses the documented limitations in manual tuning and static code generation, utilizing RL-guided large models for kernel code generation and selection, achieving robust average speedup—up to 22% over torch.matmul and 19.2% over cuBLAS—over 1,000 configurations, with even higher gains in server mode. The implication is a new paradigm for performance-critical kernel development, portending wide applicability in deep learning workloads on novel hardware targets (2512.02551).

Motivation and Problem Scope

Matrix multiplication (matmul), and specifically HGEMM, underpins computational workloads in LLMs and transformer-based models, constituting a performance-critical kernel in both training and inference paths. While NVIDIA’s cuBLAS and cuBLASLt libraries offer high performance via expert-crafted, architecture-specific kernel schedules, the vast combinatorial space dictated by $(M, N, K)$ dimensions, layout constraints, and underlying hardware implies that manual or heuristic-based approaches leave significant performance untapped. This challenge is amplified by divergent optimal strategies across architecture generations (Ampere, Hopper, Blackwell, etc.) and tile sizes, as well as by input-dependent optimal parameterization.

The key technical gap lies in discovering optimized strategies that generalize across this space without sacrificing correctness or incurring reward hacking during RL or code generation. Prior LLM-based systems and RL-augmented frameworks lack systematic, configuration-wide coverage or only operate on narrow benchmarks unsuited for production-level sweep over real model workloads.

CUDA-L2 Methodology

CUDA-L2 advances the state-of-the-art through a tightly integrated pipeline:

Continued Pretraining: The LLM backbone is further pretrained on a highly diverse corpus of CUDA code, spanning established libraries (e.g., ATen, CUTLASS) and web-scraped examples, with instruction synthesis for RL prompt engineering. Retrieval-augmented context is used to incorporate architectural and API updates in the foundation model.
Multi-Stage RL Training: Optimization is staged from general-kernel RL (on varied CUDA primitives) to HGEMM-specific RL, utilizing execution time as reward, and integrating comprehensive NVIDIA Nsight Compute profiling metrics to inform reward shaping.
Contrastive RL with Correctness Regularization: Speed, numerical deviation (bounded by established kernel outputs), and code brevity are jointly optimized. Hacking opportunities (multiple streams, lazy execution) are suppressed by enforcing compilation and direct execution of generated ${.cu}$ files via nvcc.
Coverage and Generalization: CUDA-L2 is evaluated on $10^3$ configurations, including those deployed in commercial LLMs, demonstrating adaptability across input size, tiling strategies, and both offline (back-to-back kernels) and server (random interval, cache-cold) inference settings.

Benchmark Results

CUDA-L2's results on A100 GPUs, benchmarked against torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning, validate both mean speedup and win rates:

Figure 1: CUDA-L2 speedup against torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning on 1,000 HGEMM configurations in offline and server settings. Offline: consecutive execution, Server: randomized intervals simulating live inference.

Key findings:

Offline (back-to-back execution): CUDA-L2 exceeds torch.matmul by +22.0%, cuBLAS-optimal by +19.2%, cuBLASLt-heuristic by +16.8%, and cuBLASLt-AutoTuning by +11.4%.
Server (randomized intervals, inference-mimicking): Gains increase to +28.7%, +26.0%, +22.4%, and +15.9% respectively, confirming improved cache utilization and hardware scheduling adaptation by RL-discovered kernels.
Max selection scenarios (choosing the best of CUDA-L2 and a baseline for each configuration) further raise speedups, supporting ensemble deployment for maximal throughput.

The strongest relative improvements materialize for small and mid-sized matrix shapes, where manual tile and pipeline schedules are insufficient and the search space of possible tilings remains underexplored by heuristics.

Technical Analysis of Optimizations

Inspection of CUDA-L2-generated kernels reveals that the RL/LLM pipeline not only finds canonical best practices but extends them with context-adaptive variations:

Abstraction and Tiling Adaptation: The system selects between raw WMMA and CuTe abstractions based on problem scale, tuning tiling/block sizes (BM, BN, BK) non-uniformly to maximize compute or occupancy, and even applies input padding with zeros to flexibly exploit more efficient tile shapes.
Advanced Memory Scheduling: CUDA-L2 discovers and discriminates between single/double-buffered register usage, aggressive multi-step prefetching (for deep pipelines with large K), direct wide-type (e.g., uint128_t) register-to-shared-memory copies to minimize memory traffic, and staggered fetch schedules for deeper compute/memory overlap.
Block Swizzle Parameterization: The usage and stride of block swizzle are modulated in response to problem scale, with high-stride swizzling employed for large configurations where L2 cache thrashing is a limiting factor.
Stage Numbering in Pipelining: The number of pipeline stages increases with the K dimension, in line with deeper matrix products requiring multi-stage latency hiding; the RL policy uncovers these empirical breakpoints without domain-specific hardcoding.

Additional analyses demonstrate that the system learns to avoid, or penalize, solutions that would incur correctness degradation or reward hacking, as validated by matching CPU-based FP32 ground truths with appropriately tight tolerances.

Practical and Theoretical Implications

Practically, CUDA-L2 demonstrates that RL-guided LLM code generation can outperform closed-source, vendor-optimized libraries even for the most scrutinized and performance-critical CUDA workloads. The framework enables configuration-wide, architecture-general code synthesis, capable at scale and within software deployment cycles unsustainable for manual tuning.

Theoretically, the paper underscores the value of combined reward sources (speed, correctness, brevity), advanced reward shaping, and contextual profiling incorporation in RL for code generation. It illustrates the emergence of latent domain knowledge, including workload-specific tiling, pipelining, and hardware-aware scheduling, from a pretraining + RL pipeline.

For future research, CUDA-L2 signals the plausibility of entirely automated kernel optimization for new hardware targets (e.g., Ada Lovelace, Hopper, Blackwell), and for other key primitives (convolution, reduction, attention). The modular system design will facilitate both architecture migration and extension to mixed-precision, sparse, or low-rank GEMM variants, as well as non-GEMM operators.

Conclusion

CUDA-L2 establishes a new data-driven, RL-guided, LLM-based paradigm for high-performance kernel generation, achieving consistent speedups over NVIDIA's state-of-the-art closed-source libraries for HGEMM across substantial real-world matrix configuration distributions (2512.02551). The system's design, encompassing diverse pretraining, multi-stage RL, correctness targeting, and adaptive scheduling, constitutes an effective template for generalized, architecture-resilient code optimization in modern AI computing. Future work will extend this framework across emerging GPU generations and broader operator classes.

PDF Markdown

Whiteboard

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Plain-English Summary of “CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning”

Overview

This paper is about speeding up a core math operation used in AI models called matrix multiplication. Think of a matrix as a big grid of numbers. Multiplying two matrices is like combining rows from one grid with columns from another to make a new grid. This operation is everywhere in modern AI, especially LLMs.

The authors built a system called CUDA-L2 that uses an AI model (a LLM) plus a training method called reinforcement learning to automatically write super-fast code for matrix multiplication on NVIDIA GPUs. Their code often runs faster than the best official NVIDIA libraries (cuBLAS and cuBLASLt), which are already highly optimized by experts.

Key Objectives and Questions

Here are the main goals the paper tries to achieve:

Can an AI system automatically write GPU code for matrix multiplication that beats expert-tuned libraries?
Can it handle many different matrix sizes (1,000 combinations) that show up in real LLMs?
Will it stay correct (give the right answers) while getting faster?
How well does it work in two real-world scenarios: nonstop heavy use (offline) and on-demand requests (server)?

Methods and Approach (in everyday terms)

The team combined a code-writing AI with reinforcement learning (RL). Reinforcement learning is like training by trial and error: the AI writes a piece of code, runs it, gets a score based on how fast it is (and how correct), and then tries again, learning to improve.

Here’s how they made it work:

They trained their AI on lots of CUDA (GPU) code, not just one small benchmark. This gives it broader “coding instincts.”
They trained in stages: first on many general GPU tasks, then focusing more narrowly on matrix multiplication.
They used a “speed reward”: faster code gets a higher score. If the code is wrong or suspicious, the score gets penalized.
They checked correctness carefully. For example, they compared results against a trusted CPU version and used tests that force exact answers for certain inputs.
They prevented cheating in timing measurements (so the AI can’t fake being fast) and used standard GPU timing tools.
They tested across 1,000 different matrix sizes (M, N, K picked from common sizes used in LLMs) and two usage modes:
- Offline mode: run kernels back-to-back nonstop (like an assembly line that never stops).
- Server mode: run kernels at random times (like orders arriving at a restaurant—sometimes busy, sometimes quiet).
They also included deep performance data (like memory speed and how busy the GPU cores are) to help the AI learn smarter optimizations.

Main Findings and Why They Matter

Big picture: CUDA-L2 beat strong baselines, including NVIDIA’s cuBLAS/cuBLASLt. Here are the headline numbers (average improvements across 1,000 matrix sizes):

In offline mode:
- About 22% faster than PyTorch’s torch.matmul.
- About 19% faster than cuBLAS (using the best layout).
- About 17% faster than cuBLASLt using its heuristic pick.
- About 11% faster than cuBLASLt AutoTuning (which tries up to ~100 algorithm variants).
In server mode (on-demand requests), the gains are even bigger:
- About 29% faster than torch.matmul.
- About 26% faster than cuBLAS.
- About 22% faster than cuBLASLt heuristic.
- About 16% faster than cuBLASLt AutoTuning.

Why this matters:

These libraries are already extremely optimized by experts. Beating them suggests AI-guided automation can push performance further by exploring huge design spaces that are impractical for humans to search by hand.
Faster matrix multiplication means faster and cheaper AI inference (answering queries) and training. That can save money, reduce energy use, and unlock smoother real-time experiences.

Extra observations:

The speed advantage is largest for smaller matrices. As matrices get very big, the GPU becomes fully loaded, leaving less room for improvement.
If users can choose the fastest of either CUDA-L2 or the baseline per case, they get even more speed overall.

Implications and Potential Impact

Automating performance tuning: CUDA-L2 shows that AI plus reinforcement learning can discover clever GPU tricks—like better memory layouts, smarter prefetching (loading data before it’s needed), and more efficient pipelines—across many matrix sizes.
Real-world benefit: Faster kernels help make LLMs and other AI models run quicker and cheaper, especially under real server conditions.
Broader reach: The current system targets NVIDIA A100 GPUs, but the approach is designed to extend to other GPU families (Ada, Hopper, Blackwell, etc.).
Future promise: As AI-driven optimization improves and covers more hardware types, developers may rely less on manual tuning and more on automated systems that continually adapt and improve.

In short, this paper shows that AI can write high-performance GPU code that beats top expert libraries for a critical AI operation. That’s a big step toward smarter, faster, and more efficient AI systems.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper that future researchers could address:

Architecture generalization: results are limited to A100; evaluate portability and required re-tuning on RTX 3090/4090 (Ampere/Ada), H100 (Hopper), and B200 (Blackwell), with consistent driver/CUDA versions and clocks/ECC settings.
Precision and accumulator coverage: extend beyond FP16 inputs and C=AB (α=1, β=0) to BF16, TF32, FP32, INT8, and mixed-precision accumulators (FP16 vs FP32); quantify accuracy–performance trade-offs.
GEMM forms and epilogues: support nonzero α/β, fused epilogues (bias, activation, scaling), strided/batched/Grouped GEMM; measure end-to-end gains in LLM layers (attention, MLP).
Shape generalization: test irregular, non-power-of-two sizes (e.g., 3072, 6144, 12288±256), very small (<64), extreme aspect ratios; report whether the learned policies generalize or require per-shape specialization.
Layout coverage: include NT and TT layouts and column-major inputs; account for data-layout conversions and measure their end-to-end cost.
Padding strategy details: specify how zero-padding is implemented (masked loads vs memory reallocation), how extra rows are discarded, and its impact on memory footprint and allocator behavior in real frameworks.
Automatic kernel selection: provide a low-overhead runtime policy to choose between CUDA-L2 and baselines per configuration; quantify selection accuracy, overhead, and guardrails against regressions.
Framework integration: demonstrate drop-in use in PyTorch/TensorRT/TVM with CUDA Graph capture; report model-level throughput/latency improvements on representative LLM inference/training pipelines.
Server-mode realism: define the arrival process (distribution, load), concurrency (multi-stream/multi-tenant), batching, and scheduling; include intervals in end-to-end latency and test cache/clock dynamics under realistic workloads.
Concurrency restrictions: forbidding additional CUDA streams prevents overlap; explore timing methods that remain robust without disabling streams and evaluate performance under multi-stream overlap.
Compilation/JIT cost: report kernel generation/compile times, binary size, caching strategy, and cold-start penalties; investigate PTX/SASS JIT or runtime specialization to reduce deployment overhead.
Correctness robustness: expand tests to random real-valued inputs, NaNs/Infs, denormals, extreme magnitudes; verify determinism and backprop correctness; use compute-sanitizer racecheck/synccheck in addition to memcheck.
Numerical error policy: justify deviation thresholds with relative/ULP metrics; compare accuracy against cuBLAS/cuBLASLt across accumulator choices; expose accuracy–latency trade-offs.
Component ablations: quantify the individual impact of continued pretraining, contrastive RL, NCU metric conditioning, and retrieval augmentation; analyze interactions and necessity of each piece.
Training efficiency and cost: disclose RL episodes per configuration, wall-clock time, compute budget, and sample efficiency; paper scaling laws with number of configurations and architectures.
Reward shaping sensitivity: analyze the effects of code-length and deviation penalties on performance; ensure brevity bias does not suppress necessary optimizations; provide tuning guidelines.
Profiling-in-the-loop overhead: detail how NCU metrics are collected (frequency, subset), instrumentation cost, and whether profiling perturbs training/evaluation; consider lightweight proxies.
Benchmark fairness and coverage: verify optimal cuBLAS/cuBLASLt settings (workspace, math modes); add baselines from tuned CUTLASS kernels, Triton, TVM/Ansor, and recent RL/LLM kernel generators (e.g., AI CUDA Engineer); include Grouped GEMM.
Thermal/power rigor: control and report power states, clocks, ECC/MIG, ambient; provide confidence intervals and statistical significance; measure energy efficiency (performance per watt).
Resource/portability constraints: report shared memory/register usage and occupancy; assess portability to GPUs with different SM/shared memory limits; add launch guardrails and fallbacks.
Data movement accounting: include costs for transposes/memory-format conversions required by each layout in end-to-end measurements.
Maintainability and safety: assess readability and long-term stability of generated code under CUDA/toolchain upgrades; incorporate static analysis, unit tests, and sanitizers into the pipeline.
Beyond matmul: evaluate transferability of the RL pipeline to other production-critical kernels (attention, softmax, layernorm, activations, reductions) at scale.
Deployment/versioning: define how architecture-specific kernels are versioned, validated, and updated across driver/CUDA changes under a unified interface.
Max(CUDA-L2, baseline) overhead: quantify runtime probing/selection cost and cold-start impact; propose lightweight predictors to avoid measurement-based selection on the critical path.
Reproducibility: release the full benchmarking harness, exact environment specs (driver/CUDA/NCCL, clocks, ECC), random seeds, and per-configuration results with confidence intervals.
Novelty validation: distinguish “discovered” techniques (e.g., staggered prefetch, ping-pong buffering) from existing CUTLASS/CuTe patterns; provide citations or empirical novelty analyses.

View Paper Prompt View All Prompts

Glossary

A100: NVIDIA data center GPU model from the Ampere generation, commonly used for AI workloads. "While the current version of CUDA-L2 only focuses on A100 GPUs"
Ada Lovelace: NVIDIA GPU architecture generation (e.g., RTX 4090) succeeding Ampere. "including Ada Lovelace, Hopper and Blackwell."
Ampere: NVIDIA GPU architecture generation (e.g., A100) with FP16 Tensor Cores. "with the {\it CUBLAS_GEMM_DEFAULT_TENSOR_OP} operation to enable Ampere FP16 Tensor Cores"
bank conflict: A shared memory access hazard where multiple threads target the same memory bank, causing serialization. "Shared memory with bank conflict avoidance, which uses a swizzle pattern to reorganize data layout, preventing conflicts from multiple threads accessing the same memory bank."
Baseline-Bounded Deviation: A correctness criterion that bounds numerical error by the maximum deviation among trusted baselines. "Baseline-Bounded Deviation"
Blackwell: NVIDIA GPU architecture generation (e.g., B200) succeeding Hopper. "Blackwell (e.g., B200)."
block swizzle: A thread-block scheduling/reordering technique to improve L2 cache locality and hit rates. "Block swizzle is parameterized by {\it swizzle_stride}, which determines the stride pattern used to reorder block indices."
CUBLAS_GEMM_DEFAULT_TENSOR_OP: cuBLAS operation mode that enables Tensor Core usage for GEMM. "with the {\it CUBLAS_GEMM_DEFAULT_TENSOR_OP} operation to enable Ampere FP16 Tensor Cores"
compute-sanitizer: NVIDIA tool to detect GPU memory errors and other issues. "{\it compute-sanitizer --tool memcheck} is used to check for memory access violations."
cublasGemmEx: cuBLAS API for GEMM supporting various data types and Tensor Core configurations. "We use the {\it cublasGemmEx} function offered by NVIDIA’s {\it cuBLAS} library"
cublasLtMatmulAlgoGetHeuristic: cuBLASLt API that returns heuristic-ranked GEMM algorithm candidates. "uses NVIDIA's {\it cublasLtMatmulAlgoGetHeuristic} API"
CUDA: NVIDIA’s parallel computing platform and programming model for GPUs. "CUDA-L2, a system that combines LLMs and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels."
cuBLAS: NVIDIA’s high-performance GPU-accelerated BLAS library for linear algebra operations. "NVIDIA’s {\it cuBLAS} library provides a strong optimized baseline"
cuBLASLt: A lower-level, tunable interface to NVIDIA’s GEMM implementations, exposing algorithm choices. "{\it cuBLASLt} provides a lower-level, more controllable interface to NVIDIA’s optimized GEMM kernels"
cuBLASLt-AutoTuning: Exhaustive selection process in cuBLASLt to choose the fastest GEMM algorithm among many candidates. "and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model"
cuBLASLt-heuristic: cuBLASLt setup that picks the top heuristic-recommended GEMM algorithm. "+16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion"
CUTLASS: NVIDIA’s CUDA templates for GEMM and related operations, used to build high-performance kernels. "such as newer {\it CUTLASS} versions"
CuTe: CUTLASS’s “Cute” library providing abstractions for tiled MMA and pipeline operations. "the {\it CuTe} library"
epilogue: Final kernel phase that writes accumulated results from registers back to memory. "Finally, in the epilogue phase, the accumulated results are written back from registers to shared memory, and then from shared memory to global memory."
FFN: Feed-Forward Network layers in transformer architectures. "used in attention and FFN layers of widely open-sourced models like Qwen, Llama and DeepSeek."
FP16: 16-bit floating-point precision (“half”), often accelerated by Tensor Cores. "accumulator precision, where using FP16 versus FP32 accumulators (both valid choices for FP16 inputs) leads to different register pressure"
FP32: 32-bit floating-point precision used for higher numerical accuracy. "accumulator precision, where using FP16 versus FP32 accumulators (both valid choices for FP16 inputs) leads to different register pressure"
GEMM: General Matrix Multiply, core linear algebra operation used in deep learning. "{\it cuBLASLt} provides a lower-level, more controllable interface to NVIDIA’s optimized GEMM kernels"
GRPO: Reinforcement learning optimization method used for updating LLM parameters. "GRPO \cite{guo2025deepseek, shao2024deepseekmath} is adopted for LLM parameter updates."
Grouped GEMM: Batch of smaller GEMMs executed together for throughput improvements. "This is evidenced by the 13\% speedup from targeted Grouped GEMM optimizations for the deepseek-R1 model"
HGEMM: Half-precision GEMM specialized for FP16 inputs. "Half-precision General Matrix Multiplication (HGEMM) is one of the most widely used matmul kernels in current LLMs."
Hopper: NVIDIA GPU architecture generation (e.g., H100). "Hopper (e.g., H100)"
L2 cache: On-GPU cache level used to buffer global memory data and improve access locality. "which improves L2 cache hit rates by reordering thread block execution"
MLPerf: Industry-standard benchmark suite for measuring machine learning performance. "Following MLPerf \cite{mattson2020mlperf, reddi2020mlperf}, a widely-used benchmark criteria for machine learning performance"
NCU (NVIDIA Nsight Compute): NVIDIA profiler providing detailed kernel metrics like occupancy and throughput. "including more comprehensive NCU (NVIDIA Nsight Compute) profiling metrics (e.g., memory throughput, SM occupancy, cache efficiency)"
NN layout: GEMM layout where both A and B are in “normal” (row-major) form. "Comparing TN (transposed-normal) and NN (normal-normal) layouts, the NN layout slightly outperforms TN across all libraries."
offline mode: Benchmark scenario with back-to-back kernel execution without pauses. "(a) Offline scenario: kernel executed consecutively without time intervals."
PTX: NVIDIA’s intermediate assembly language for CUDA kernels. "CUDA C/C++, CuTe, inline PTX assembly, CUDA intrinsics, and CUTLASS templates can be used"
register pressure: Resource constraint when kernels use many registers, potentially reducing occupancy and performance. "leads to different register pressure and thus different optimization strategies."
reinforcement learning (RL): Learning paradigm optimizing kernel performance using rewards (e.g., speed). "we propose CUDA-L2, a system that combines LLMs and reinforcement learning (RL) to automatically optimize matmul CUDA kernels."
retrieval-augmented context: Technique to enrich model inputs with external documents or code for better decisions. "incorporating retrieval-augmented context to accommodate new knowledge or architectural characteristics not covered in the foundation model."
server mode: Benchmark scenario with random intervals simulating real-time inference. "(b) Server scenario: kernel executed at intervals, simulating real-time inference."
shared memory: On-chip memory used to stage tiles and reduce global memory accesses. "Tiles of size BM × BK from A and BK × BN from B are loaded from global memory into shared memory"
SM occupancy: Fraction of active warps per Streaming Multiprocessor indicating execution parallelism. "including more comprehensive NCU (NVIDIA Nsight Compute) profiling metrics (e.g., memory throughput, SM occupancy, cache efficiency)"
swizzle pattern: Address permutation strategy to avoid shared memory bank conflicts or improve cache locality. "Shared memory with bank conflict avoidance, which uses a swizzle pattern to reorganize data layout"
tensor cores: Specialized GPU units for fast matrix multiply-accumulate on low-precision data. "Once in registers, tensor cores perform the actual matrix multiply-accumulate operations."
torch.matmul: PyTorch high-level matrix multiplication API that dispatches to backend libraries. "PyTorch's {\it torch.matmul} naturally constitutes a baseline."
Triton: Python-based DSL for writing GPU kernels. "Therefore, CUDA C/C++, CuTe, inline PTX assembly, CUDA intrinsics, and CUTLASS templates can be used, but not Python-based DSLs like Triton."
WMMA: Warp Matrix Multiply Accumulate intrinsics for Tensor Core operations. "For smaller matrices, it generates lightweight kernels directly using raw WMMA."
warp occupancy: Measure of how many warps are active on an SM, affecting throughput. "such as memory throughput, compute utilization, warp occupancy, and cache hit rates"

View Paper Prompt View All Prompts

Practical Applications

Applications of CUDA-L2: Practical, Real-World Uses

Below are the practical applications derived from the paper’s findings, methods, and innovations. They are grouped into Immediate Applications (deployable now) and Long-Term Applications (requiring additional research, scaling, or development). Each item includes sectors, likely tools or products, and assumptions or dependencies that may affect feasibility.

Immediate Applications

Drop-in acceleration of LLM inference on A100-based services — Replace existing HGEMM paths with CUDA-L2 kernels to reduce latency and increase throughput for attention and FFN layers commonly used in Qwen, Llama, DeepSeek.
- Sectors: software/cloud, AI infrastructure
- Tools/workflows: PyTorch custom ops, TensorRT plugin, ONNX Runtime EP, dynamic runtime selection “max(CUDA-L2, baseline)” for each (M,N,K)
- Assumptions/dependencies: FP16 HGEMM; A100 GPUs; target shapes among the 1,000 MNK configurations; integration and correctness validation in the hosting framework
Lower compute costs and energy consumption for AI serving — Use server-mode speedups (+15.9% vs cuBLASLt-AutoTuning, +28.7% vs torch.matmul) to cut GPU-hours and power bills for real-time inference endpoints.
- Sectors: energy, finance, cloud operations
- Tools/workflows: cost/performance dashboards, power telemetry, autoscaling policies adjusted to improved throughput
- Assumptions/dependencies: workloads dominated by matmul; benefits greatest for small-to-mid matrix sizes; datacenter uses A100
High-availability, low-latency consumer AI experiences — Faster chatbots, search assistants, coding copilots, and customer support agents due to better server-mode matmul performance under sporadic request arrivals.
- Sectors: consumer software, e-commerce, customer service
- Tools/workflows: model serving platforms (NVIDIA Triton Inference Server, Ray Serve, SageMaker endpoints), traffic-aware kernel selection
- Assumptions/dependencies: end-to-end latency is matmul-bound; integration preserves numerical tolerances required by application
Throughput boosts for batch jobs and micro-batch inference — In offline mode, exploit +11–22% average gains to process more requests or larger contexts per unit time.
- Sectors: cloud batch processing, content moderation, data labeling
- Tools/workflows: pipeline schedulers (Airflow, Argo), batch drivers tuned for CUDA-L2 kernels
- Assumptions/dependencies: batch jobs utilize FP16 matmul; shapes align with optimized configurations; diminishing gains for very large matrices
On-prem AI acceleration for robotics and industrial automation — Improve perception and planning model inference on A100 servers in factories and labs.
- Sectors: robotics, manufacturing, logistics
- Tools/workflows: ROS-based AI nodes, real-time inference runtimes, CUDA-L2-backed HGEMM ops in perception stacks
- Assumptions/dependencies: precision requirements aligned with FP16; deterministic correctness bounds acceptable
“Max-of-portfolio” kernel selection in production runtimes — Implement an automated fallback to whichever kernel (CUDA-L2 or cuBLAS/cuBLASLt) is fastest per MNK triplet with caching.
- Sectors: software, AI infrastructure
- Tools/workflows: heuristic selector, caching of per-shape winners, startup warmup to precompute winners
- Assumptions/dependencies: shape distribution is stable or cache refreshed; selector overhead negligible
CI-driven performance engineering — Integrate CUDA-L2’s evaluation harness (Nsight Compute metrics + correctness tests) into CI to catch regressions and auto-tune new shapes seen in production traces.
- Sectors: software engineering, MLOps
- Tools/workflows: Nsight Compute integration, speedup/variance gates, correctness (binary-input exact match and baseline-bounded deviation) checks
- Assumptions/dependencies: access to representative traces; CI runners with A100; tolerance thresholds defined per product
Academic teaching and reproducible systems research — Use the open repository to teach LLM-guided RL for systems optimization, including timing safeguards and profiling-driven context.
- Sectors: academia
- Tools/workflows: course labs, research projects benchmarking CUDA-L2 vs. cuBLAS/cuBLASLt; GRPO training demonstrations
- Assumptions/dependencies: A100 access; curricula covering CUDA/CuTe/CUTLASS; safe sandboxing for kernel generation/execution
Vendor-neutral benchmarking augmentation — Emulate MLPerf-like offline/server distinctions and correctness gates to evaluate matmul kernels in realistic scenarios.
- Sectors: benchmarking, standards organizations
- Tools/workflows: offline vs server test harness, random-order execution, warmup behavior capture; public reporting of win rates and speedup distributions
- Assumptions/dependencies: agreement on methodologies; hardware availability; governance for benchmark updates
Library contributions and kernel portfolio hardening — Feed discovered techniques (e.g., double-buffered fragments, staggered prefetch, direct wide epilogue copies, block swizzle choices) back into CuTe/CUTLASS and internal codebases.
- Sectors: software tooling, open-source ecosystems
- Tools/workflows: PRs to CUTLASS/CuTe, internal kernel repos with per-architecture variants
- Assumptions/dependencies: maintainers accept contributions; methods validated across edge cases and numerics

Long-Term Applications

Cross-architecture portability and scale-out — Extend CUDA-L2 to Ada, Hopper, Blackwell, and consumer Ampere; maintain per-architecture RL tuning with retrieval-augmented contexts reflecting hardware characteristics.
- Sectors: software/cloud, semiconductor ecosystems
- Tools/workflows: multi-arch build/test pipelines, per-GPU kernel catalogs, automated feature flagging
- Assumptions/dependencies: access to diverse GPUs; architecture-specific instruction sets and memory hierarchies; reward functions adapted per arch
Generalization beyond HGEMM — Apply the LLM+RL pipeline to other performance-critical ops (BF16 GEMM, INT8 GEMM, convolutions, attention kernels, reductions, embeddings).
- Sectors: deep learning frameworks, HPC
- Tools/workflows: operator registries, per-operator RL training datasets, cross-op correctness suites
- Assumptions/dependencies: reference implementations; precision/quantization nuances; larger search spaces and reward design
Online, adaptive auto-tuning in production — Continuously learn and update kernels from live traffic (shape distributions, cache temperature, power/thermal telemetry) and deploy improvements safely with canarying.
- Sectors: cloud, MLOps
- Tools/workflows: online RL or bandits, canary rollout, rollback triggers, telemetry-driven reward shaping
- Assumptions/dependencies: safe exploration budgets; robust anti-hacking timing measures; governance for correctness regressions
Compiler and framework integration — Embed RL-discovered schedules into nvcc, CUTLASS/CuTe templates, or PyTorch JIT passes; co-design with autotuners to pick schedules at compile-time or install-time.
- Sectors: compilers, ML frameworks
- Tools/workflows: schedule databases, intermediate representation (IR) hooks, installer-time shape probing
- Assumptions/dependencies: compiler extensibility; IR stability; license and IP considerations
Energy-aware kernel selection and carbon budgeting — Choose kernels optimizing performance-per-watt or meeting power caps; integrate with datacenter sustainability targets and carbon reporting.
- Sectors: energy, policy, cloud sustainability
- Tools/workflows: power telemetry, per-kernel energy models, SLOs blending latency and energy
- Assumptions/dependencies: accurate power sensing; standardized reporting; multi-objective optimization policies
Formal verification and safety envelopes for numerics — Strengthen correctness with formal methods (beyond baseline-bounded deviation) for healthcare imaging, autonomous driving, and finance risk models.
- Sectors: healthcare, automotive, finance
- Tools/workflows: SMT-based checks for floating-point properties, tolerance-aware acceptance tests, regulatory audit trails
- Assumptions/dependencies: domain-specific correctness definitions; regulator acceptance of floating-point variability bounds
Hardware–software co-design feedback loops — Use CUDA-L2’s learned strategies (e.g., bank conflict swizzles, multi-stage pipelines) to inform future GPU microarchitectural features and memory subsystem designs.
- Sectors: semiconductor, systems research
- Tools/workflows: simulation harnesses, joint design reviews, synthetic kernels emphasizing bottlenecks
- Assumptions/dependencies: vendor collaboration; non-disclosure constraints; alignment with silicon roadmaps
Distributed and grouped GEMM optimization — Extend tuning to multi-GPU and grouped GEMM scenarios (e.g., tensor-parallel attention blocks) to maximize speedups across pipeline/model-parallel topologies.
- Sectors: large-scale training/inference
- Tools/workflows: NCCL-aware scheduling, collective-optimized matmul variants, topology-aware swizzle/prefetch strategies
- Assumptions/dependencies: interconnect characteristics (NVLink/PCIe), synchronization overheads, grouped GEMM shape distributions
Standards and procurement guidelines — Encourage adoption of server-mode benchmarks and anti-timing-hack practices in industry standards; guide procurement to prioritize systems with automated kernel optimization.
- Sectors: policy, enterprise IT
- Tools/workflows: updated benchmark specs, RFP criteria including automated optimization capabilities
- Assumptions/dependencies: consensus in standards bodies; clear vendor disclosures; measurable compliance
Consumer benefits via cost pass-through — As infra costs drop, offer lower-priced AI subscriptions or higher quotas; enable richer real-time features in consumer apps.
- Sectors: consumer software, education
- Tools/workflows: pricing models reflecting lower GPU-hours, feature rollouts leveraging lower latency
- Assumptions/dependencies: cloud providers pass savings downstream; user demand aligns with latency-sensitive features

Notes on feasibility across all applications:

Speedups are largest for small-to-mid matrix sizes; gains diminish for very large matrices that already saturate GPU compute.
Current implementation targets FP16 HGEMM on A100; porting to other precisions (BF16/FP32) and architectures requires additional engineering.
Integration must maintain numerical correctness within application-tolerated bounds; formal guarantees are advisable in safety-critical domains.
Reward and timing safeguards (e.g., disallowing extra streams, .cu-only kernels, synchronizations) should be preserved in any extended optimization pipeline.

View Paper Prompt View All Prompts

Open Problems

Generalization of KernelBench-Optimized CUDA Kernels to Real-World Production

Continue Learning

Authors (6)

Collections

GitHub

GitHub - deepreinforce-ai/CUDA-L2: CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning (4 stars)

Tweets

HackerNews

CUDA-l2: Surpassing cuBLAS performance for matrix multiplication through RL (132 points, 15 comments)

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning (2512.02551v1)

Sponsor

Summary

CUDA-L2: Reinforcement Learning-Guided Optimization of Matrix Multiplication Kernels

Overview

Motivation and Problem Scope

CUDA-L2 Methodology

Benchmark Results

Technical Analysis of Optimizations

Practical and Theoretical Implications

Conclusion

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain-English Summary of “CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning”

Overview

Key Objectives and Questions

Methods and Approach (in everyday terms)

Main Findings and Why They Matter

Implications and Potential Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Practical Applications

Applications of CUDA-L2: Practical, Real-World Uses

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

GitHub

Tweets

HackerNews