Analytical GEMM Kernel Selection Model

Updated 17 January 2026

The paper introduces a deterministic, architecture-aware framework that predicts and optimizes GEMM performance, energy, and resource utilization.
The methodology combines queuing theory, cost modeling, and multi-output regression to enable efficient kernel selection without extensive autotuning.
Empirical validations demonstrate significant speedup, improved energy efficiency, and near-optimal performance across diverse platforms including CPUs, GPUs, and edge devices.

Analytical models for GEMM kernel selection provide a deterministic, architecture-aware framework for predicting and optimizing General Matrix Multiplication (GEMM) performance, resource utilization, and energy efficiency. These models systematically encode hardware constraints, memory hierarchies, and kernel parameterizations to select optimal configurations without extensive empirical autotuning. Modern approaches blend queuing-theoretic cost modeling, hierarchical memory analysis, and, more recently, machine learning–based regression to address the combinatorial complexity of GEMM tuning across diverse platforms, including edge devices, CPUs, and GPUs (Xiaoteng et al., 2024, Veras et al., 2016, Ramírez et al., 2024, Swann et al., 3 Dec 2025).

1. Mathematical Structures and Feature Spaces in Analytical GEMM Models

Analytical models formalize the relationship between GEMM kernel parameters and key performance or energy metrics via explicit functions of both architectural constants and tile/block configurations. A typical model predicts runtime $\hat T$ and power $\hat P$ using regression or cost-model equations parametrized by matrix sizes $(M,N,K)$ and kernel/block design parameters $t_b$ :

$\hat T(M,N,K,t_b) = f_T(M,N,K,t_b)$

$\hat P(M,N,K,t_b) = f_P(M,N,K,t_b)$

$\hat E(M,N,K,t_b) = \hat T(M,N,K,t_b) \cdot \hat P(M,N,K,t_b)$

where $t_b$ encompasses tile size, thread block structure, shared memory usage, and associated pipeline parameters (Xiaoteng et al., 2024).

Feature vectors for prediction models encompass both raw and derived attributes:

Matrix dimensions ( $M, N, K$ ),
Total operation volume ( $M N K$ ),
Output size ( $M N$ ),
Arithmetic intensity,
Tile size and derived grid/block shapes,
Shared memory consumption,
Streaming Multiprocessor (SM) or Compute Unit (CU) occupancy,
Memory efficiency (e.g., actual/peak bandwidth),
Memory layout encodings,
Kernel-specific scalars, pipeline stages, and warps per block (Xiaoteng et al., 2024, Swann et al., 3 Dec 2025).

All numerical features are typically standardized prior to model fitting.

2. Model Construction: Cost Models and Machine Learning Approaches

Classical analytical models—such as the queuing-theoretic framework for micro-kernel selection—explicitly count loads, arithmetic operations, and pipeline bottlenecks, reducing the kernel shape selection to a throughput maximization constrained by register pressure:

$\text{Performance} = \frac{2 m_R n_R}{\max_i \left( \frac{L_i}{\text{throughput}_i} \right)}$

subject to

$R_C + R_A + R_R \leq R_V$

where $m_R, n_R$ denote register block shapes, $L_i$ the number of instructions per pipeline $i$ , and $R_V$ the total vector register count (Veras et al., 2016).

Modern models increasingly incorporate supervised learning, particularly multi-output regression using Random Forests, where each output (runtime, power, energy, TFLOPS) is predicted by an ensemble of trees fit independently to bootstrapped samples of the feature–target relation. The use of a MultiOutputRegressor wrapper enables joint prediction of correlated metrics (Xiaoteng et al., 2024). The key metrics for model evaluation are $R^2$ scores, mean absolute percentage error, and median percentage error, with state-of-the-art results achieving $R^2 = 0.98$ for runtime and $0.78$ for power on advanced GPU architectures.

3. Algorithmic Blocking and Analytical Trade-offs

GEMM blocking strategies partition the global matrix multiplication into hierarchical tiles matched to the CPU/GPU cache or memory organization. Analytical models expose the following trade-offs:

Small tiles ( $\mathrm{TS} < 8$ ): maximize occupancy but induce high thread scheduling and become memory-bound.
Large tiles ( $\mathrm{TS} > 16$ ): cause shared memory oversubscription, sharply reducing SM occupancy (as low as 1 block per SM for $\mathrm{TS} = 32$ ).
Intermediate tiles (especially $\mathrm{TS} = 16$ ): empirically balance parallelism and data reuse, achieving maximal TFLOPS while respecting shared memory limits and providing robust performance across a wide matrix size range (Xiaoteng et al., 2024).

Roofline-style models in GPU-centric analytical frameworks calculate, for each block $(M_b,N_b,K_b)$ : $T_\text{comp} = \frac{2 M_b N_b K_b}{\mathrm{FLOPS}_\text{GPU}}$

$T_\text{mem} = \sum_L \frac{\text{data}_L}{\mathrm{BW}_L} + L_\text{lat}$

$T_\text{step} = \max(T_\text{comp}, T_\text{mem})$

where $L$ iterates over memory hierarchy levels; the minimum-latency (i.e., maximum throughput) configuration is selected, subject to occupancy and register constraints (Swann et al., 3 Dec 2025).

4. Model-driven Kernel Selection Algorithms

Analytical models enable deterministic, autotuner-free GEMM kernel selection by evaluating a constrained set of candidate tile/block shapes against calibrated, closed-form cost or regression predictions. Selection algorithms typically follow this scheme:

Enumerate feasible tile sizes and kernel variants respecting resource constraints.
For each configuration, predict runtime and/or energy using the analytical or ML-derived model.
Score each configuration (optionally combining speedup and energy saving via developer-set weights).
Select the configuration with the lowest predicted cost or maximal multi-objective score.

A representative selection pseudocode is:

def select_best_gemm_config(M, N, K, w_perf=0.5, w_energy=0.5):
    tile_sizes = [4, 8, 16, 32]
    kernels = get_available_kernels()
    best_score = +inf
    best_cfg = None
    best_preds = None

    for TS in tile_sizes:
        for kernel in kernels:
            x = build_feature_vector(M, N, K, TS, kernel)
            T_pred, P_pred = model.predict([x])[0][:2]
            E_pred = T_pred * P_pred
            speedup = baseline_T / T_pred
            energy_saving = (baseline_E - E_pred) / baseline_E
            score = - (w_perf*speedup + w_energy*energy_saving)

            if score < best_score:
                best_score = score
                best_cfg = dict(tile_size=TS, kernel=kernel)
                best_preds = (T_pred, P_pred, E_pred)

    return best_cfg, best_preds

(Xiaoteng et al., 2024)

For analytical models such as tritonBLAS, the configuration search involves enumerating tile triplets $(M_b,N_b,K_b)$ , evaluating performance via the model, and selecting the minimum-latency candidate (Swann et al., 3 Dec 2025).

5. Validation and Empirical Performance

Analytical models and their associated selection routines are validated against empirical GEMM benchmarking. On NVIDIA Ada Lovelace (RTX 4070), a Random Forest–based model achieves runtime prediction $R^2 = 0.9808$ and mean absolute error $15.57\%$ ; power prediction achieves $R^2 = 0.7783$ and median error $5.42\%$ (Xiaoteng et al., 2024). Correct tile selection (TS=16) can yield up to $3.2\times$ speedup, $22\%$ reduction in power, and energy savings exceeding $40\%$ compared to naive tile baselines.

On modern GPUs (e.g., MI300X), analytical frameworks like tritonBLAS select kernels whose realized performance is within $95\%$ of optimal empirical autotuners across $150,000$ random shapes, sharply reducing selection time from minutes to microseconds (Swann et al., 3 Dec 2025). On heterogeneous edge devices, simulator-based analytical models calibrated to platform-specific bandwidths, cache, and register banks predict execution time within $2\%$ of actual measurements, successfully guiding the selection among blocking schemes and micro-kernel shapes (Ramírez et al., 2024).

6. Adaptation to Diverse Architectures and Model Limitations

Portability of analytical models across architectures requires only retuning of hardware parameters:

SIMD width, register file size, and pipeline throughput (for CPUs/vector ISAs) (Veras et al., 2016),
Cache capacities and bandwidths, tensor-core tile shape/latency, CU counts (for GPUs) (Swann et al., 3 Dec 2025),
Measured DMA rates, register windows, and arithmetic peak (for edge processors) (Ramírez et al., 2024).

A plausible implication is that the main limitations stem from factors not explicitly modeled: cache associativity, replacement policy, data layout irregularities, mixed-precision/irregular compute, or inter-GPU communication. The model accuracy is sustained where tiling/blocking dominates performance and architectural bottlenecks are well-characterized. When candidate space is sufficiently rich and hardware constants are refreshed, analytical models approach the empirical-optimal kernel, rendering autotuning largely redundant under these constraints.

7. Comparative Summary of Approaches

Analytical Model	Platform Focus	Key Predictors	Selection Mechanism	Achieved Fidelity
Random Forest (ML)	NVIDIA GPU (Ada)	M, N, K, tile/block params, occupancy	Multi-output regression	$R^2=0.98$ (runtime), $0.78$ (power)
Register-blocking Model	x86/AVX2 CPU	Register shape, instruction mix	Queuing-theoretic cost	Within $3\%$ of measured per-core peak
Simulator + Bandwidth	Edge (GAP8, IoT)	Panel/block sizes, measured bandwidths	Tiling/packing enumeration	$<2\%$ error in total runtime
tritonBLAS	Discrete GPU	HW constants: cache, MFMA, CU count	Closed-form search	$>95\%$ of empirical autotuner

These analytical models collectively define the state of the art in autotuner-free GEMM kernel selection and are foundational for modern high-performance linear algebra, deep learning, and edge computing workloads (Xiaoteng et al., 2024, Veras et al., 2016, Ramírez et al., 2024, Swann et al., 3 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (4)

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach (2024)

Automating the Last-Mile for High Performance Dense Linear Algebra (2016)

Performance Analysis of Matrix Multiplication for Deep Learning on the Edge (2024)

tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Analytical Model for GEMM Kernel Selection.

Analytical GEMM Kernel Selection Model

1. Mathematical Structures and Feature Spaces in Analytical GEMM Models

2. Model Construction: Cost Models and Machine Learning Approaches

3. Algorithmic Blocking and Analytical Trade-offs

4. Model-driven Kernel Selection Algorithms

5. Validation and Empirical Performance

6. Adaptation to Diverse Architectures and Model Limitations

7. Comparative Summary of Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Analytical GEMM Kernel Selection Model

1. Mathematical Structures and Feature Spaces in Analytical GEMM Models

2. Model Construction: Cost Models and Machine Learning Approaches

3. Algorithmic Blocking and Analytical Trade-offs

4. Model-driven Kernel Selection Algorithms

5. Validation and Empirical Performance

6. Adaptation to Diverse Architectures and Model Limitations

7. Comparative Summary of Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research