CUDA-L2: Optimizing Half-Precision GEMM Kernels

Updated 15 December 2025

CUDA-L2 system is a framework that optimizes HGEMM operations by integrating reinforcement learning, compiler search algorithms, and analytic code generation across diverse architectures.
It leverages a multidimensional design space involving tiling factors, loop configurations, and hardware constraints to maximize throughput and energy efficiency.
Empirical studies show that CUDA-L2 outperforms vendor-tuned libraries with up to 29% throughput improvements and significant energy savings in various deployment scenarios.

General Matrix-Matrix Multiplication (GEMM) in half-precision format, referred to as HGEMM, is a core computational kernel ubiquitous across deep learning, scientific computing, and high-performance signal processing. The optimization of HGEMM encompasses diverse architectures, such as heterogeneous FPGAs, embedded GPUs, and modern compiler toolchains, and demands rigorous treatments of throughput, energy efficiency, communication optimality, and automated kernel generation. This entry synthesizes the theoretical formulation, practical search strategies, and empirical findings from recent research, spanning machine learning-guided design space exploration, RL-driven kernel synthesis, analytic code generation, compiler IR scheduling, and hypergraph partitioning for sparse multiplication.

1. Formal Definition and Optimization Objectives

HGEMM executes the matrix operation

$C = \alpha AB + \beta C, \quad \text{with} \quad A \in \mathbb{R}^{M \times K}, \; B \in \mathbb{R}^{K \times N}, \; C \in \mathbb{R}^{M \times N}$

where $\alpha=1$ and $\beta=0$ in most optimization scenarios, yielding $C=AB$ . The optimization problem seeks to maximize throughput (TFLOPS) and, on relevant hardware, energy efficiency (GFLOPS/W), subject to hardware-specific constraints. On heterogeneous platforms like Versal ACAP, the objective involves finding resource mappings $\theta$ that optimize performance

$P(\theta) = \frac{2 M N K}{T(\theta)}$

and energy efficiency

$E(\theta) = \frac{P(\theta)}{\text{Power}(\theta)}$

where $T(\theta)$ is latency and $\text{Power}(\theta)$ is average consumption for mapping $\theta$ (Papalamprou et al., 10 Nov 2025). In multi-objective settings, the Pareto frontier in $(P,E)$ is constructed, enabling tunable trade-offs.

On GPU platforms, kernel optimization aims to minimize runtime $t_\text{custom}$ across broad configuration spaces, maximizing

$\text{TFLOPS} = \frac{2MNK}{10^{12}t_\text{custom}}$

subject to architectural and numerical constraints (Su et al., 2 Dec 2025). For sparse variants, communication volume and processor-load balance become dominant objectives (Ballard et al., 2016).

2. Design Space: Variables and Physical Constraints

Optimization is parameterized by a multidimensional design space:

Tiling factors ( $P_d$ , $B_d$ ): parallelism and buffer tile-sizes in GEMM axes, controlling assignment of sub-tiles to compute engines (Papalamprou et al., 10 Nov 2025, Su et al., 2 Dec 2025).
Loop nest configuration: tile dimensions $(\tau_m, \tau_k, \tau_n)$ , loop permutations, unroll factors $(u_m, u_k, u_n)$ , and vectorization widths $(v_m, v_k, v_n)$ (Zhang et al., 2019).
Kernel parameters on GPU: WMMA abstraction, shared-memory swizzle, pipelining depth, layout choice, register buffering, zero-padding for irregular tile sizes, thread-block scheduling order, and prefetch strategies (Su et al., 2 Dec 2025).
Resource constraints: on-chip buffer capacity, SIMD register count, memory bandwidth, and critical path instruction throughput (Veras et al., 2016, Papalamprou et al., 10 Nov 2025, Zhang et al., 2019).

Physical constraints include:

Buffer size: For Versal PL, $B_M B_K + B_K B_N + B_M B_N$ tiles times datawidth must fit in $M_\text{PL}$ .
Scratchpad: For AIE, 12 KB per tile (below 32 KB/AIE) (Papalamprou et al., 10 Nov 2025).
Register and pipeline limits for CPU SIMD inner-kernels (Veras et al., 2016).
Memory bandwidth and communication bounds for sparse operations represented as hypergraph partitioning (Ballard et al., 2016).

3. Search and Optimization Algorithms

State-of-the-art methods apply data-driven and analytical approaches:

Machine Learning Guided DSE: XGBoost models trained on thousands of on-board measurements predict latency, power, and resource usage for every $(P_d, B_d)$ tuple. These inform a rapid sweep over candidate mappings, extracting Maximally Pareto-optimal points (Papalamprou et al., 10 Nov 2025).
RL with LLM Guidance: CUDA-L2 couples RL policy optimization (GRPO) with LLM-based code generation. States comprise problem dimensions and telemetry (e.g., NCU profiling metrics); actions are complete kernel parameterizations; rewards combine speedup, numerical correctness, and code brevity. Contrastive prompting and multi-stage RL enable exploration of <1000 configurations (Su et al., 2 Dec 2025).
Analytic Code Generation: Micro-kernel generation via outer-product decomposition, instruction mix and pipeline modeling. The optimal tile shapes and unroll schedule are derived analytically per register/pipeline constraints, yielding SIMD code competitive with manual expert-tuned kernels (Veras et al., 2016).
Compiler-level Search (G-BFS, N-A2C): Lightweight heuristic-guided best-first search and neighborhood RL (A2C) identify optimal scheduling for HGEMM at the compiler IR level. Cost models factor FLOPS, data movement, loop overhead, and vectorization efficiency to yield rapid convergence with minimal search (Zhang et al., 2019).
Hypergraph Partitioning for SpGEMM: For sparse HGEMM, constructing fine/coarse-grained hypergraphs encode all multiplies and data dependencies. Partitioning these models with balance tolerances minimize per-processor (or per-memory-level) communication, matching theoretical lower bounds (Ballard et al., 2016).

Method	Target HW	Optimized Objective
ML-guided DSE	Versal ACAP	GFLOPS, GFLOPS/W
RL+LLM	Nvidia CUDA	Throughput, code size
Analytic Gen	CPU SIMD	Cycle-accurate GFLOPS
G-BFS/N-A2C	TVM (GPU/CPU)	Compile-time latency
Hypergraph Part	Multi-core	Comm. volume, balance

4. Empirical Results and Comparative Performance

Research demonstrates substantial gains over existing autotuned and manually crafted baselines:

Versal ACAP: ML-guided mapping yields $1.23\times$ (up to $2.5\times$ ) throughput and $1.25\times$ (up to $2.7\times$ ) energy efficiency versus analytical frameworks (CHARM, ARIES), and outperforms Jetson Orin by up to $2.3\times$ / $2.0\times$ on compute-bound shapes (Papalamprou et al., 10 Nov 2025).
CUDA-L2: RL+LLM auto-synthesis surpasses torch.matmul and cuBLAS/cuBLASLt-AutoTuning by $11$– $22\%$ average throughput across $1000$ configurations in offline mode, and by up to $15$– $29\%$ in server (inference) mode, establishing RL-driven LLMs as superceding human heuristics (Su et al., 2 Dec 2025).
Compiler-level: G-BFS and N-A2C cut HGEMM runtime by $24\%$ – $40\%$ over XGBoost/RNN auto-tuners, while traversing only $0.1\%$ of the search space, with lowest variance and search time per schedule (Zhang et al., 2019).
Analytic Generation: On modern CPUs, analytic code generation approaches expert-tuned OpenBLAS performance (within $2\%$ – $5\%$ ), exceeding ATLAS and scaling efficiently (Veras et al., 2016).
Sparse Multiplication: Fine-grained or monochrome-2D partitioning attains near-minimal communication for irregular sparsity patterns (Markov clustering, AMG, normal equations), with some 1D models up to $20$– $80\times$ suboptimal (Ballard et al., 2016).

5. Theoretical Foundations and Key Constraints

Performance Modeling: Predictive models incorporate raw and derived features: tile parallelism ( $N_\text{AIE}$ ), FLOP-per-engine ratio ( $\rho$ ), ratio metrics ( $d/P_d$ , $d/B_d$ ) are critical to generalize across unseen matrix shapes (Papalamprou et al., 10 Nov 2025).
Register and Pipeline Bounds: SIMD code requires simultaneous register allocation for accumulators, A/B tiles, and permutation temporaries. Through analytic enumeration, optimal tile shapes satisfy both throughput and register/pipeline constraints (Veras et al., 2016).
Cost Models: Roofline-based cost structure combines arithmetic intensity and memory bandwidth with hardware-specific tuning parameters. Additional loop-overhead and vectorization penalty terms yield more realistic compiler-level models (Zhang et al., 2019).
Hypergraph Lower Bounds: Communication volume minimization is equivalent to balanced partitioning in hypergraph models, and broadcast/reduce schedules approach these lower bounds up to $\log P$ factors (Ballard et al., 2016).
Pareto Front Extraction: The set of optimal designs balancing throughput and energy is constructed via pointwise comparison in $(P,E)$ , or by scalarizing with $F(\theta;\alpha)$ for user-specified weighting (Papalamprou et al., 10 Nov 2025).

6. Implementation Practices and Insights

Empirical and theoretical work distills actionable best practices:

Tile selection and buffering: Larger PL tiles improve data reuse; energy-optimal designs use fewer compute engines and trade minor throughput loss for substantial energy savings (Papalamprou et al., 10 Nov 2025).
RL Parameterization: Explicit kernel abstractions (WMMA vs. CuTe), pipelining depth, and swizzle parameters yield nontrivial performance multiplicities; contrastive RL rewards and neighborhood constraint exploration are decisive (Su et al., 2 Dec 2025).
Search space design: Restricting tile depths and vector widths per hardware capability, and asynchronous cost measurement (via analytic models), enables low overhead, high-quality compiler-based scheduling (Zhang et al., 2019).
Hypergraph model selection: Coarse-grained (1D/2D) models suffice for well-structured sparsity; irregular sparsity benefits from fine-grained partitioning, justifying higher algorithmic and computational complexity (Ballard et al., 2016).
Derived feature inclusion: Features like FLOP-per-engine and size-to-tile ratios are essential for accurate model extrapolation to new problem instances (Papalamprou et al., 10 Nov 2025).

7. Contextual Significance and Future Implications

The HGEMM optimization problem encapsulates the multidisciplinary integration of numerical linear algebra, hardware-aware code generation, machine learning, and combinatorial algorithms. Recent findings indicate that ML-driven and RL+LLM frameworks systematically outperform vendor heuristics and manual kernel synthesis, especially as configuration spaces and hardware architectures scale. This suggests a paradigm shift toward hybrid analytical/data-driven optimization for critical computational kernels. For sparse variants, hypergraph-based communication modeling remains a robust foundation for balancing parallel efficiency and memory hierarchy utilization.

A plausible implication is that future HGEMM and SpGEMM optimization frameworks will adopt reinforcement learning and large-scale empirical profiling as integrated components, automating kernel generation and mapping across heterogeneous architectures, and closing the gap to hardware limits with minimal manual intervention.