Analytical GEMM Kernel Selection Model
- The paper introduces a deterministic, architecture-aware framework that predicts and optimizes GEMM performance, energy, and resource utilization.
- The methodology combines queuing theory, cost modeling, and multi-output regression to enable efficient kernel selection without extensive autotuning.
- Empirical validations demonstrate significant speedup, improved energy efficiency, and near-optimal performance across diverse platforms including CPUs, GPUs, and edge devices.
Analytical models for GEMM kernel selection provide a deterministic, architecture-aware framework for predicting and optimizing General Matrix Multiplication (GEMM) performance, resource utilization, and energy efficiency. These models systematically encode hardware constraints, memory hierarchies, and kernel parameterizations to select optimal configurations without extensive empirical autotuning. Modern approaches blend queuing-theoretic cost modeling, hierarchical memory analysis, and, more recently, machine learning–based regression to address the combinatorial complexity of GEMM tuning across diverse platforms, including edge devices, CPUs, and GPUs (Xiaoteng et al., 2024, Veras et al., 2016, Ramírez et al., 2024, Swann et al., 3 Dec 2025).
1. Mathematical Structures and Feature Spaces in Analytical GEMM Models
Analytical models formalize the relationship between GEMM kernel parameters and key performance or energy metrics via explicit functions of both architectural constants and tile/block configurations. A typical model predicts runtime and power using regression or cost-model equations parametrized by matrix sizes and kernel/block design parameters :
where encompasses tile size, thread block structure, shared memory usage, and associated pipeline parameters (Xiaoteng et al., 2024).
Feature vectors for prediction models encompass both raw and derived attributes:
- Matrix dimensions (),
- Total operation volume (),
- Output size (),
- Arithmetic intensity,
- Tile size and derived grid/block shapes,
- Shared memory consumption,
- Streaming Multiprocessor (SM) or Compute Unit (CU) occupancy,
- Memory efficiency (e.g., actual/peak bandwidth),
- Memory layout encodings,
- Kernel-specific scalars, pipeline stages, and warps per block (Xiaoteng et al., 2024, Swann et al., 3 Dec 2025).
All numerical features are typically standardized prior to model fitting.
2. Model Construction: Cost Models and Machine Learning Approaches
Classical analytical models—such as the queuing-theoretic framework for micro-kernel selection—explicitly count loads, arithmetic operations, and pipeline bottlenecks, reducing the kernel shape selection to a throughput maximization constrained by register pressure:
subject to
where denote register block shapes, the number of instructions per pipeline , and the total vector register count (Veras et al., 2016).
Modern models increasingly incorporate supervised learning, particularly multi-output regression using Random Forests, where each output (runtime, power, energy, TFLOPS) is predicted by an ensemble of trees fit independently to bootstrapped samples of the feature–target relation. The use of a MultiOutputRegressor wrapper enables joint prediction of correlated metrics (Xiaoteng et al., 2024). The key metrics for model evaluation are scores, mean absolute percentage error, and median percentage error, with state-of-the-art results achieving for runtime and $0.78$ for power on advanced GPU architectures.
3. Algorithmic Blocking and Analytical Trade-offs
GEMM blocking strategies partition the global matrix multiplication into hierarchical tiles matched to the CPU/GPU cache or memory organization. Analytical models expose the following trade-offs:
- Small tiles (): maximize occupancy but induce high thread scheduling and become memory-bound.
- Large tiles (): cause shared memory oversubscription, sharply reducing SM occupancy (as low as 1 block per SM for ).
- Intermediate tiles (especially ): empirically balance parallelism and data reuse, achieving maximal TFLOPS while respecting shared memory limits and providing robust performance across a wide matrix size range (Xiaoteng et al., 2024).
Roofline-style models in GPU-centric analytical frameworks calculate, for each block :
where iterates over memory hierarchy levels; the minimum-latency (i.e., maximum throughput) configuration is selected, subject to occupancy and register constraints (Swann et al., 3 Dec 2025).
4. Model-driven Kernel Selection Algorithms
Analytical models enable deterministic, autotuner-free GEMM kernel selection by evaluating a constrained set of candidate tile/block shapes against calibrated, closed-form cost or regression predictions. Selection algorithms typically follow this scheme:
- Enumerate feasible tile sizes and kernel variants respecting resource constraints.
- For each configuration, predict runtime and/or energy using the analytical or ML-derived model.
- Score each configuration (optionally combining speedup and energy saving via developer-set weights).
- Select the configuration with the lowest predicted cost or maximal multi-objective score.
A representative selection pseudocode is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def select_best_gemm_config(M, N, K, w_perf=0.5, w_energy=0.5): tile_sizes = [4, 8, 16, 32] kernels = get_available_kernels() best_score = +inf best_cfg = None best_preds = None for TS in tile_sizes: for kernel in kernels: x = build_feature_vector(M, N, K, TS, kernel) T_pred, P_pred = model.predict([x])[0][:2] E_pred = T_pred * P_pred speedup = baseline_T / T_pred energy_saving = (baseline_E - E_pred) / baseline_E score = - (w_perf*speedup + w_energy*energy_saving) if score < best_score: best_score = score best_cfg = dict(tile_size=TS, kernel=kernel) best_preds = (T_pred, P_pred, E_pred) return best_cfg, best_preds |
For analytical models such as tritonBLAS, the configuration search involves enumerating tile triplets , evaluating performance via the model, and selecting the minimum-latency candidate (Swann et al., 3 Dec 2025).
5. Validation and Empirical Performance
Analytical models and their associated selection routines are validated against empirical GEMM benchmarking. On NVIDIA Ada Lovelace (RTX 4070), a Random Forest–based model achieves runtime prediction and mean absolute error ; power prediction achieves and median error (Xiaoteng et al., 2024). Correct tile selection (TS=16) can yield up to speedup, reduction in power, and energy savings exceeding compared to naive tile baselines.
On modern GPUs (e.g., MI300X), analytical frameworks like tritonBLAS select kernels whose realized performance is within of optimal empirical autotuners across $150,000$ random shapes, sharply reducing selection time from minutes to microseconds (Swann et al., 3 Dec 2025). On heterogeneous edge devices, simulator-based analytical models calibrated to platform-specific bandwidths, cache, and register banks predict execution time within of actual measurements, successfully guiding the selection among blocking schemes and micro-kernel shapes (Ramírez et al., 2024).
6. Adaptation to Diverse Architectures and Model Limitations
Portability of analytical models across architectures requires only retuning of hardware parameters:
- SIMD width, register file size, and pipeline throughput (for CPUs/vector ISAs) (Veras et al., 2016),
- Cache capacities and bandwidths, tensor-core tile shape/latency, CU counts (for GPUs) (Swann et al., 3 Dec 2025),
- Measured DMA rates, register windows, and arithmetic peak (for edge processors) (Ramírez et al., 2024).
A plausible implication is that the main limitations stem from factors not explicitly modeled: cache associativity, replacement policy, data layout irregularities, mixed-precision/irregular compute, or inter-GPU communication. The model accuracy is sustained where tiling/blocking dominates performance and architectural bottlenecks are well-characterized. When candidate space is sufficiently rich and hardware constants are refreshed, analytical models approach the empirical-optimal kernel, rendering autotuning largely redundant under these constraints.
7. Comparative Summary of Approaches
| Analytical Model | Platform Focus | Key Predictors | Selection Mechanism | Achieved Fidelity |
|---|---|---|---|---|
| Random Forest (ML) | NVIDIA GPU (Ada) | M, N, K, tile/block params, occupancy | Multi-output regression | (runtime), $0.78$ (power) |
| Register-blocking Model | x86/AVX2 CPU | Register shape, instruction mix | Queuing-theoretic cost | Within of measured per-core peak |
| Simulator + Bandwidth | Edge (GAP8, IoT) | Panel/block sizes, measured bandwidths | Tiling/packing enumeration | error in total runtime |
| tritonBLAS | Discrete GPU | HW constants: cache, MFMA, CU count | Closed-form search | of empirical autotuner |
These analytical models collectively define the state of the art in autotuner-free GEMM kernel selection and are foundational for modern high-performance linear algebra, deep learning, and edge computing workloads (Xiaoteng et al., 2024, Veras et al., 2016, Ramírez et al., 2024, Swann et al., 3 Dec 2025).