Papers
Topics
Authors
Recent
Search
2000 character limit reached

Analytical GEMM Kernel Selection Model

Updated 17 January 2026
  • The paper introduces a deterministic, architecture-aware framework that predicts and optimizes GEMM performance, energy, and resource utilization.
  • The methodology combines queuing theory, cost modeling, and multi-output regression to enable efficient kernel selection without extensive autotuning.
  • Empirical validations demonstrate significant speedup, improved energy efficiency, and near-optimal performance across diverse platforms including CPUs, GPUs, and edge devices.

Analytical models for GEMM kernel selection provide a deterministic, architecture-aware framework for predicting and optimizing General Matrix Multiplication (GEMM) performance, resource utilization, and energy efficiency. These models systematically encode hardware constraints, memory hierarchies, and kernel parameterizations to select optimal configurations without extensive empirical autotuning. Modern approaches blend queuing-theoretic cost modeling, hierarchical memory analysis, and, more recently, machine learning–based regression to address the combinatorial complexity of GEMM tuning across diverse platforms, including edge devices, CPUs, and GPUs (Xiaoteng et al., 2024, Veras et al., 2016, Ramírez et al., 2024, Swann et al., 3 Dec 2025).

1. Mathematical Structures and Feature Spaces in Analytical GEMM Models

Analytical models formalize the relationship between GEMM kernel parameters and key performance or energy metrics via explicit functions of both architectural constants and tile/block configurations. A typical model predicts runtime T^\hat T and power P^\hat P using regression or cost-model equations parametrized by matrix sizes (M,N,K)(M,N,K) and kernel/block design parameters tbt_b:

T^(M,N,K,tb)=fT(M,N,K,tb)\hat T(M,N,K,t_b) = f_T(M,N,K,t_b)

P^(M,N,K,tb)=fP(M,N,K,tb)\hat P(M,N,K,t_b) = f_P(M,N,K,t_b)

E^(M,N,K,tb)=T^(M,N,K,tb)P^(M,N,K,tb)\hat E(M,N,K,t_b) = \hat T(M,N,K,t_b) \cdot \hat P(M,N,K,t_b)

where tbt_b encompasses tile size, thread block structure, shared memory usage, and associated pipeline parameters (Xiaoteng et al., 2024).

Feature vectors for prediction models encompass both raw and derived attributes:

  • Matrix dimensions (M,N,KM, N, K),
  • Total operation volume (MNKM N K),
  • Output size (MNM N),
  • Arithmetic intensity,
  • Tile size and derived grid/block shapes,
  • Shared memory consumption,
  • Streaming Multiprocessor (SM) or Compute Unit (CU) occupancy,
  • Memory efficiency (e.g., actual/peak bandwidth),
  • Memory layout encodings,
  • Kernel-specific scalars, pipeline stages, and warps per block (Xiaoteng et al., 2024, Swann et al., 3 Dec 2025).

All numerical features are typically standardized prior to model fitting.

2. Model Construction: Cost Models and Machine Learning Approaches

Classical analytical models—such as the queuing-theoretic framework for micro-kernel selection—explicitly count loads, arithmetic operations, and pipeline bottlenecks, reducing the kernel shape selection to a throughput maximization constrained by register pressure:

Performance=2mRnRmaxi(Lithroughputi)\text{Performance} = \frac{2 m_R n_R}{\max_i \left( \frac{L_i}{\text{throughput}_i} \right)}

subject to

RC+RA+RRRVR_C + R_A + R_R \leq R_V

where mR,nRm_R, n_R denote register block shapes, LiL_i the number of instructions per pipeline ii, and RVR_V the total vector register count (Veras et al., 2016).

Modern models increasingly incorporate supervised learning, particularly multi-output regression using Random Forests, where each output (runtime, power, energy, TFLOPS) is predicted by an ensemble of trees fit independently to bootstrapped samples of the feature–target relation. The use of a MultiOutputRegressor wrapper enables joint prediction of correlated metrics (Xiaoteng et al., 2024). The key metrics for model evaluation are R2R^2 scores, mean absolute percentage error, and median percentage error, with state-of-the-art results achieving R2=0.98R^2 = 0.98 for runtime and $0.78$ for power on advanced GPU architectures.

3. Algorithmic Blocking and Analytical Trade-offs

GEMM blocking strategies partition the global matrix multiplication into hierarchical tiles matched to the CPU/GPU cache or memory organization. Analytical models expose the following trade-offs:

  • Small tiles (TS<8\mathrm{TS} < 8): maximize occupancy but induce high thread scheduling and become memory-bound.
  • Large tiles (TS>16\mathrm{TS} > 16): cause shared memory oversubscription, sharply reducing SM occupancy (as low as 1 block per SM for TS=32\mathrm{TS} = 32).
  • Intermediate tiles (especially TS=16\mathrm{TS} = 16): empirically balance parallelism and data reuse, achieving maximal TFLOPS while respecting shared memory limits and providing robust performance across a wide matrix size range (Xiaoteng et al., 2024).

Roofline-style models in GPU-centric analytical frameworks calculate, for each block (Mb,Nb,Kb)(M_b,N_b,K_b): Tcomp=2MbNbKbFLOPSGPUT_\text{comp} = \frac{2 M_b N_b K_b}{\mathrm{FLOPS}_\text{GPU}}

Tmem=LdataLBWL+LlatT_\text{mem} = \sum_L \frac{\text{data}_L}{\mathrm{BW}_L} + L_\text{lat}

Tstep=max(Tcomp,Tmem)T_\text{step} = \max(T_\text{comp}, T_\text{mem})

where LL iterates over memory hierarchy levels; the minimum-latency (i.e., maximum throughput) configuration is selected, subject to occupancy and register constraints (Swann et al., 3 Dec 2025).

4. Model-driven Kernel Selection Algorithms

Analytical models enable deterministic, autotuner-free GEMM kernel selection by evaluating a constrained set of candidate tile/block shapes against calibrated, closed-form cost or regression predictions. Selection algorithms typically follow this scheme:

  1. Enumerate feasible tile sizes and kernel variants respecting resource constraints.
  2. For each configuration, predict runtime and/or energy using the analytical or ML-derived model.
  3. Score each configuration (optionally combining speedup and energy saving via developer-set weights).
  4. Select the configuration with the lowest predicted cost or maximal multi-objective score.

A representative selection pseudocode is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def select_best_gemm_config(M, N, K, w_perf=0.5, w_energy=0.5):
    tile_sizes = [4, 8, 16, 32]
    kernels = get_available_kernels()
    best_score = +inf
    best_cfg = None
    best_preds = None

    for TS in tile_sizes:
        for kernel in kernels:
            x = build_feature_vector(M, N, K, TS, kernel)
            T_pred, P_pred = model.predict([x])[0][:2]
            E_pred = T_pred * P_pred
            speedup = baseline_T / T_pred
            energy_saving = (baseline_E - E_pred) / baseline_E
            score = - (w_perf*speedup + w_energy*energy_saving)

            if score < best_score:
                best_score = score
                best_cfg = dict(tile_size=TS, kernel=kernel)
                best_preds = (T_pred, P_pred, E_pred)

    return best_cfg, best_preds
(Xiaoteng et al., 2024)

For analytical models such as tritonBLAS, the configuration search involves enumerating tile triplets (Mb,Nb,Kb)(M_b,N_b,K_b), evaluating performance via the model, and selecting the minimum-latency candidate (Swann et al., 3 Dec 2025).

5. Validation and Empirical Performance

Analytical models and their associated selection routines are validated against empirical GEMM benchmarking. On NVIDIA Ada Lovelace (RTX 4070), a Random Forest–based model achieves runtime prediction R2=0.9808R^2 = 0.9808 and mean absolute error 15.57%15.57\%; power prediction achieves R2=0.7783R^2 = 0.7783 and median error 5.42%5.42\% (Xiaoteng et al., 2024). Correct tile selection (TS=16) can yield up to 3.2×3.2\times speedup, 22%22\% reduction in power, and energy savings exceeding 40%40\% compared to naive tile baselines.

On modern GPUs (e.g., MI300X), analytical frameworks like tritonBLAS select kernels whose realized performance is within 95%95\% of optimal empirical autotuners across $150,000$ random shapes, sharply reducing selection time from minutes to microseconds (Swann et al., 3 Dec 2025). On heterogeneous edge devices, simulator-based analytical models calibrated to platform-specific bandwidths, cache, and register banks predict execution time within 2%2\% of actual measurements, successfully guiding the selection among blocking schemes and micro-kernel shapes (Ramírez et al., 2024).

6. Adaptation to Diverse Architectures and Model Limitations

Portability of analytical models across architectures requires only retuning of hardware parameters:

A plausible implication is that the main limitations stem from factors not explicitly modeled: cache associativity, replacement policy, data layout irregularities, mixed-precision/irregular compute, or inter-GPU communication. The model accuracy is sustained where tiling/blocking dominates performance and architectural bottlenecks are well-characterized. When candidate space is sufficiently rich and hardware constants are refreshed, analytical models approach the empirical-optimal kernel, rendering autotuning largely redundant under these constraints.

7. Comparative Summary of Approaches

Analytical Model Platform Focus Key Predictors Selection Mechanism Achieved Fidelity
Random Forest (ML) NVIDIA GPU (Ada) M, N, K, tile/block params, occupancy Multi-output regression R2=0.98R^2=0.98 (runtime), $0.78$ (power)
Register-blocking Model x86/AVX2 CPU Register shape, instruction mix Queuing-theoretic cost Within 3%3\% of measured per-core peak
Simulator + Bandwidth Edge (GAP8, IoT) Panel/block sizes, measured bandwidths Tiling/packing enumeration <2%<2\% error in total runtime
tritonBLAS Discrete GPU HW constants: cache, MFMA, CU count Closed-form search >95%>95\% of empirical autotuner

These analytical models collectively define the state of the art in autotuner-free GEMM kernel selection and are foundational for modern high-performance linear algebra, deep learning, and edge computing workloads (Xiaoteng et al., 2024, Veras et al., 2016, Ramírez et al., 2024, Swann et al., 3 Dec 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Analytical Model for GEMM Kernel Selection.