CUDA-L2: Optimizing Half-Precision GEMM Kernels
- CUDA-L2 system is a framework that optimizes HGEMM operations by integrating reinforcement learning, compiler search algorithms, and analytic code generation across diverse architectures.
- It leverages a multidimensional design space involving tiling factors, loop configurations, and hardware constraints to maximize throughput and energy efficiency.
- Empirical studies show that CUDA-L2 outperforms vendor-tuned libraries with up to 29% throughput improvements and significant energy savings in various deployment scenarios.
General Matrix-Matrix Multiplication (GEMM) in half-precision format, referred to as HGEMM, is a core computational kernel ubiquitous across deep learning, scientific computing, and high-performance signal processing. The optimization of HGEMM encompasses diverse architectures, such as heterogeneous FPGAs, embedded GPUs, and modern compiler toolchains, and demands rigorous treatments of throughput, energy efficiency, communication optimality, and automated kernel generation. This entry synthesizes the theoretical formulation, practical search strategies, and empirical findings from recent research, spanning machine learning-guided design space exploration, RL-driven kernel synthesis, analytic code generation, compiler IR scheduling, and hypergraph partitioning for sparse multiplication.
1. Formal Definition and Optimization Objectives
HGEMM executes the matrix operation
where and in most optimization scenarios, yielding . The optimization problem seeks to maximize throughput (TFLOPS) and, on relevant hardware, energy efficiency (GFLOPS/W), subject to hardware-specific constraints. On heterogeneous platforms like Versal ACAP, the objective involves finding resource mappings that optimize performance
and energy efficiency
where is latency and is average consumption for mapping (Papalamprou et al., 10 Nov 2025). In multi-objective settings, the Pareto frontier in is constructed, enabling tunable trade-offs.
On GPU platforms, kernel optimization aims to minimize runtime across broad configuration spaces, maximizing
subject to architectural and numerical constraints (Su et al., 2 Dec 2025). For sparse variants, communication volume and processor-load balance become dominant objectives (Ballard et al., 2016).
2. Design Space: Variables and Physical Constraints
Optimization is parameterized by a multidimensional design space:
- Tiling factors (, ): parallelism and buffer tile-sizes in GEMM axes, controlling assignment of sub-tiles to compute engines (Papalamprou et al., 10 Nov 2025, Su et al., 2 Dec 2025).
- Loop nest configuration: tile dimensions , loop permutations, unroll factors , and vectorization widths (Zhang et al., 2019).
- Kernel parameters on GPU: WMMA abstraction, shared-memory swizzle, pipelining depth, layout choice, register buffering, zero-padding for irregular tile sizes, thread-block scheduling order, and prefetch strategies (Su et al., 2 Dec 2025).
- Resource constraints: on-chip buffer capacity, SIMD register count, memory bandwidth, and critical path instruction throughput (Veras et al., 2016, Papalamprou et al., 10 Nov 2025, Zhang et al., 2019).
Physical constraints include:
- Buffer size: For Versal PL, tiles times datawidth must fit in .
- Scratchpad: For AIE, 12 KB per tile (below 32 KB/AIE) (Papalamprou et al., 10 Nov 2025).
- Register and pipeline limits for CPU SIMD inner-kernels (Veras et al., 2016).
- Memory bandwidth and communication bounds for sparse operations represented as hypergraph partitioning (Ballard et al., 2016).
3. Search and Optimization Algorithms
State-of-the-art methods apply data-driven and analytical approaches:
- Machine Learning Guided DSE: XGBoost models trained on thousands of on-board measurements predict latency, power, and resource usage for every tuple. These inform a rapid sweep over candidate mappings, extracting Maximally Pareto-optimal points (Papalamprou et al., 10 Nov 2025).
- RL with LLM Guidance: CUDA-L2 couples RL policy optimization (GRPO) with LLM-based code generation. States comprise problem dimensions and telemetry (e.g., NCU profiling metrics); actions are complete kernel parameterizations; rewards combine speedup, numerical correctness, and code brevity. Contrastive prompting and multi-stage RL enable exploration of <1000 configurations (Su et al., 2 Dec 2025).
- Analytic Code Generation: Micro-kernel generation via outer-product decomposition, instruction mix and pipeline modeling. The optimal tile shapes and unroll schedule are derived analytically per register/pipeline constraints, yielding SIMD code competitive with manual expert-tuned kernels (Veras et al., 2016).
- Compiler-level Search (G-BFS, N-A2C): Lightweight heuristic-guided best-first search and neighborhood RL (A2C) identify optimal scheduling for HGEMM at the compiler IR level. Cost models factor FLOPS, data movement, loop overhead, and vectorization efficiency to yield rapid convergence with minimal search (Zhang et al., 2019).
- Hypergraph Partitioning for SpGEMM: For sparse HGEMM, constructing fine/coarse-grained hypergraphs encode all multiplies and data dependencies. Partitioning these models with balance tolerances minimize per-processor (or per-memory-level) communication, matching theoretical lower bounds (Ballard et al., 2016).
| Method | Target HW | Optimized Objective |
|---|---|---|
| ML-guided DSE | Versal ACAP | GFLOPS, GFLOPS/W |
| RL+LLM | Nvidia CUDA | Throughput, code size |
| Analytic Gen | CPU SIMD | Cycle-accurate GFLOPS |
| G-BFS/N-A2C | TVM (GPU/CPU) | Compile-time latency |
| Hypergraph Part | Multi-core | Comm. volume, balance |
4. Empirical Results and Comparative Performance
Research demonstrates substantial gains over existing autotuned and manually crafted baselines:
- Versal ACAP: ML-guided mapping yields (up to ) throughput and (up to ) energy efficiency versus analytical frameworks (CHARM, ARIES), and outperforms Jetson Orin by up to / on compute-bound shapes (Papalamprou et al., 10 Nov 2025).
- CUDA-L2: RL+LLM auto-synthesis surpasses torch.matmul and cuBLAS/cuBLASLt-AutoTuning by $11$– average throughput across $1000$ configurations in offline mode, and by up to $15$– in server (inference) mode, establishing RL-driven LLMs as superceding human heuristics (Su et al., 2 Dec 2025).
- Compiler-level: G-BFS and N-A2C cut HGEMM runtime by – over XGBoost/RNN auto-tuners, while traversing only of the search space, with lowest variance and search time per schedule (Zhang et al., 2019).
- Analytic Generation: On modern CPUs, analytic code generation approaches expert-tuned OpenBLAS performance (within –), exceeding ATLAS and scaling efficiently (Veras et al., 2016).
- Sparse Multiplication: Fine-grained or monochrome-2D partitioning attains near-minimal communication for irregular sparsity patterns (Markov clustering, AMG, normal equations), with some 1D models up to $20$– suboptimal (Ballard et al., 2016).
5. Theoretical Foundations and Key Constraints
- Performance Modeling: Predictive models incorporate raw and derived features: tile parallelism (), FLOP-per-engine ratio (), ratio metrics (, ) are critical to generalize across unseen matrix shapes (Papalamprou et al., 10 Nov 2025).
- Register and Pipeline Bounds: SIMD code requires simultaneous register allocation for accumulators, A/B tiles, and permutation temporaries. Through analytic enumeration, optimal tile shapes satisfy both throughput and register/pipeline constraints (Veras et al., 2016).
- Cost Models: Roofline-based cost structure combines arithmetic intensity and memory bandwidth with hardware-specific tuning parameters. Additional loop-overhead and vectorization penalty terms yield more realistic compiler-level models (Zhang et al., 2019).
- Hypergraph Lower Bounds: Communication volume minimization is equivalent to balanced partitioning in hypergraph models, and broadcast/reduce schedules approach these lower bounds up to factors (Ballard et al., 2016).
- Pareto Front Extraction: The set of optimal designs balancing throughput and energy is constructed via pointwise comparison in , or by scalarizing with for user-specified weighting (Papalamprou et al., 10 Nov 2025).
6. Implementation Practices and Insights
Empirical and theoretical work distills actionable best practices:
- Tile selection and buffering: Larger PL tiles improve data reuse; energy-optimal designs use fewer compute engines and trade minor throughput loss for substantial energy savings (Papalamprou et al., 10 Nov 2025).
- RL Parameterization: Explicit kernel abstractions (WMMA vs. CuTe), pipelining depth, and swizzle parameters yield nontrivial performance multiplicities; contrastive RL rewards and neighborhood constraint exploration are decisive (Su et al., 2 Dec 2025).
- Search space design: Restricting tile depths and vector widths per hardware capability, and asynchronous cost measurement (via analytic models), enables low overhead, high-quality compiler-based scheduling (Zhang et al., 2019).
- Hypergraph model selection: Coarse-grained (1D/2D) models suffice for well-structured sparsity; irregular sparsity benefits from fine-grained partitioning, justifying higher algorithmic and computational complexity (Ballard et al., 2016).
- Derived feature inclusion: Features like FLOP-per-engine and size-to-tile ratios are essential for accurate model extrapolation to new problem instances (Papalamprou et al., 10 Nov 2025).
7. Contextual Significance and Future Implications
The HGEMM optimization problem encapsulates the multidisciplinary integration of numerical linear algebra, hardware-aware code generation, machine learning, and combinatorial algorithms. Recent findings indicate that ML-driven and RL+LLM frameworks systematically outperform vendor heuristics and manual kernel synthesis, especially as configuration spaces and hardware architectures scale. This suggests a paradigm shift toward hybrid analytical/data-driven optimization for critical computational kernels. For sparse variants, hypergraph-based communication modeling remains a robust foundation for balancing parallel efficiency and memory hierarchy utilization.
A plausible implication is that future HGEMM and SpGEMM optimization frameworks will adopt reinforcement learning and large-scale empirical profiling as integrated components, automating kernel generation and mapping across heterogeneous architectures, and closing the gap to hardware limits with minimal manual intervention.