General Matrix-Matrix Multiplication (GEMM)

Updated 13 May 2026

General Matrix-Matrix Multiplication (GEMM) is a fundamental operation that computes C := αAB + βC, underpinning applications in deep learning, simulations, and linear algebra.
Modern implementations leverage advanced blocking, fused-packing, and hardware-specific microkernels to optimize cache use, vectorization, and throughput across diverse architectures.
Unified strategies like GEMMFIP, mixed-precision, and auto-tuning balance speed, energy efficiency, and accuracy, streamlining performance across varied matrix sizes and platforms.

General Matrix-Matrix Multiplication (GEMM) is a foundational computational primitive in numerical linear algebra, scientific computing, and machine learning workloads. Its precise formulation, optimization principles, and evolving implementation strategies profoundly impact the performance of high-level applications, library design, and hardware utilization. The GEMM operation computes the update

$C := \alpha AB + \beta C$

where $A \in \mathbb{R}^{m \times k}$ , $B \in \mathbb{R}^{k \times n}$ , $C \in \mathbb{R}^{m \times n}$ , and $\alpha, \beta \in \mathbb{R}$ . Elementwise, this corresponds to

$C_{i,j} = \alpha \sum_{p=1}^k A_{i,p} B_{p,j} + \beta C_{i,j}$

which is computationally intensive for large, dense matrices as in deep learning and scientific simulations (Xu et al., 2023). Due to its centrality, enormous research effort has gone into algorithmic structuring, code generation, hardware specialization, and unified models to maximize GEMM's efficiency across architectures and problem scales.

1. GEMM Fundamentals and Blocking Principles

GEMM's high arithmetic intensity renders it optimally suited for exploiting hierarchical memory and deep vectorization. State-of-the-art implementations (GotoBLAS, BLIS) decompose GEMM into a series of nested loops around a minimal "microkernel". Key blocking parameters include:

$m_R, n_R$ : microkernel tile sizes (register-level block)
$k_C$ : inner-panel dimension (fits L1 or L0)
$m_C$ : row panel of $A$ (fits L2)
$A \in \mathbb{R}^{m \times k}$ 0: column panel of $A \in \mathbb{R}^{m \times k}$ 1 (fits L3 or LLC)

The canonical strategy partitions $A \in \mathbb{R}^{m \times k}$ 2 along its rows into panels of height $A \in \mathbb{R}^{m \times k}$ 3, further split into $A \in \mathbb{R}^{m \times k}$ 4 micropanels; $A \in \mathbb{R}^{m \times k}$ 5 is partitioned along columns analogously into $A \in \mathbb{R}^{m \times k}$ 6 micropanels. Before microkernel execution, blocks are "packed" into contiguous buffers $A \in \mathbb{R}^{m \times k}$ 7, $A \in \mathbb{R}^{m \times k}$ 8 to realize unit-stride accesses and cache set balance, which heavily reduces cache misses and associativity conflicts (Xu et al., 2023).

This five-loop blocking structure, with carefully tuned parameters for a given hardware memory hierarchy and register set, underpins all modern GEMM libraries in CPUs, GPUs, and FPGAs.

2. Unified Fused-Packing Techniques: GEMMFIP

Traditional GEMM implementations rely on runtime heuristics to choose between "no packing" (favoring small matrices) and "full packing" schemes (favoring large matrices), creating complexity and performance discontinuities. GEMMFIP (Fused-In Packing) proposes a unified solution that interleaves, and ultimately fuses, packing with the first use of data, hiding packing latency and avoiding superfluous memory traffic (Xu et al., 2023).

Key Aspects:

Interleaved Packing + Compute: Instead of packing entire panels upfront, each panel is packed on-the-fly immediately before its first use in the microkernel. Already packed subpanels are then reused from cache when beneficial.
Microkernel Fused-Packing: Packing logic is inlined within the microkernel, producing four function specializations: (i) both $A \in \mathbb{R}^{m \times k}$ 9 and $B \in \mathbb{R}^{k \times n}$ 0 packed (classical) (ii) $B \in \mathbb{R}^{k \times n}$ 1 packed, $B \in \mathbb{R}^{k \times n}$ 2 unpacked (iii) $B \in \mathbb{R}^{k \times n}$ 3 unpacked, $B \in \mathbb{R}^{k \times n}$ 4 packed (iv) both unpacked Conditional logic avoids unnecessary stores and loads.
Packing-Skip Heuristics: Simple runtime checks determine if packing is worth the cost: skip $B \in \mathbb{R}^{k \times n}$ 5 packing for $B \in \mathbb{R}^{k \times n}$ 6, skip $B \in \mathbb{R}^{k \times n}$ 7 packing for $B \in \mathbb{R}^{k \times n}$ 8, and perform a cache-residency test to determine if subpanels fit in L2.

Outcomes:

GEMMFIP delivers smooth, nearly flat performance across all sizes, matching the best characteristics of "skip packing" for small sizes and traditional BLIS for large. No explicit "crossover" point or performance valley is observed. These principles allow a single code path to efficiently address the entire space of GEMM problems, significantly simplifying library design (Xu et al., 2023).

3. Hardware and Architecture-Specific Implementations

Beyond algorithmic structuring, the exact implementation of GEMM depends on hardware features. Major architectures employ strategies that adapt the five-loop blocking and packing paradigm to specialized functional units, memory systems, and ISA capabilities:

ARM SME (Scalable Matrix Extension):

MpGEMM leverages ARM SME's 2D ZA tile accumulators and FMOPA "outer-product" instructions. Key highlights include cache-aware partitioning to keep working sets within L2, on-the-fly panel-wise transposition using tile loads/stores, and pipelined outer-product microkernels. All available tile registers and multi-vector loads (LD1W) must be maximally utilized for peak bandwidth; buffers are aligned to cache-line boundaries. This regime attains up to 94% of SME's peak (INT8 GEMM) and achieves $B \in \mathbb{R}^{k \times n}$ 9– $C \in \mathbb{R}^{m \times n}$ 0 speedup versus standard libraries on Apple M4 Pro and Armv8 SVE (Deng et al., 25 Dec 2025).

AMD Versal AIE2:

GAMA uses custom on-chip buffer placement in the 64KB SRAM banks of each AI Engine, forming "packs" of four cores to pipeline larger GEMMs and streaming partial sums across 512b cascade networks. Double-buffering overlaps PLIO <-> AIE transfers with compute. Pack size and staggered core placement are co-optimized to minimize route congestion and maximize functional utilization. Peak efficiency reaches 85–86% of the hardware maximum for INT8 and bfloat16 (Mhatre et al., 13 Apr 2025).

FPGA Binary128 Accelerator:

A pipelined 2D systolic array of micro-coded PEs, each with a full 128-bit FMA pipeline, enables high-throughput binary128 GEMM. Local on-chip buffering and feed/drain pipelines reduce off-chip bandwidth needs. Measured performance is up to 90GFlops at 91% of peak for large matrices, yielding speedups of $C \in \mathbb{R}^{m \times n}$ 1 over a 20-thread Intel i9 (Kono et al., 2023).

Systolic Arrays for Sparse GEMM:

SPOTS combines a tall 128x4 output-stationary systolic array with distributed input buffering, controlled zero-skipping logic for pruned weight layouts, and dynamic partitioning for sub-matrix tasks. This yields high energy efficiency and throughput, particularly for small or sparse convolutions (Soltaniyeh et al., 2021).

4. Mixed-Precision and Quantized GEMM

Modern deep learning workloads increasingly exploit reduced-precision computation for power and throughput gains. GEMM implementations for INT8, BF16, FP16, and mixed precision are now standard:

Dot/K-Dot Microkernels: Architectures with INT8-BF16 dot-product instructions (e.g., AVX-512 VNNI, ARM NEON/SVE2/SME2, RISC-V IME) favor microkernel shapes and packing layouts to maximize the use of such instructions. The register block, tile size, and blocking factors are chosen to fill microarchitectural resources (registers, tile storage) without spillage (Martínez et al., 13 Jun 2025).
Packing Strategies: Data layouts pack A and B along microkernel boundaries for unit-stride access and maximal SIMD vectorization. For tile engines (e.g., SME2), entire submatrices are tile-loaded directly.
Performance: Mixed-precision GEMM delivers $C \in \mathbb{R}^{m \times n}$ 2– $C \in \mathbb{R}^{m \times n}$ 3 speedups versus FP32, with energy savings exceeding $C \in \mathbb{R}^{m \times n}$ 4, and negligible accuracy loss ( $C \in \mathbb{R}^{m \times n}$ 5 for ResNet and BERT) (Martínez et al., 13 Jun 2025, Zhang et al., 20 Aug 2025).
Per-Tile Precision Assignment: Tile-centric frameworks schedule independent precision choices at the block/tile level, scheduling tasks to available compute resources while balancing throughput, energy, and accuracy. The PaRSEC runtime in GEMM-MP achieves near-theory scaling and relative error under $C \in \mathbb{R}^{m \times n}$ 6 for blended double-single applications (Zhang et al., 20 Aug 2025).

5. Portability, Code Generation, and Automatic Tuning

Robust GEMM performance across architectures, problem sizes, and data types requires systematic code generation and auto-tuning:

Apache TVM Auto-Generation: By scripting the full blocked, packed algorithm in TVM's compute/schedule DSL, families of microkernels covering all necessary (m_r,n_r) shapes can be automatically generated, compiled, and autotuned to target hardware. The generator expresses nested loop structure, packing schedules, and can auto-explore the optimal register-block configuration for any matrix shape (Alaejos et al., 2023).
Machine Learning-Based Tuning: For runtime parameter selection (e.g., thread count in multi-threaded GEMM), on-the-fly ML models (ADSALA) can dynamically optimize concurrency, yielding up to $C \in \mathbb{R}^{m \times n}$ 7– $C \in \mathbb{R}^{m \times n}$ 8 speedup for node-local GEMM under 100MB (Xia et al., 14 Jan 2026). For more complex mappings (e.g., in Versal ACAP heterogeneous platforms), gradient-boosted models trained on benchmarked data efficiently navigate pareto-optimal trade-offs in throughput and energy (Papalamprou et al., 10 Nov 2025).
Compiler-Driven Tiling Optimization: The selection of block, tile, and data layout parameters is increasingly approached as a black-box combinatorial optimization, with methods ranging from greedy-best-first search to reinforcement learning controlling parameterized TVM schedules for minimum actual runtime (Zhang et al., 2019).

6. Advanced and Special-Purpose GEMM Variants

Several research directions extend GEMM primitives beyond dense, high-throughput settings:

Communication-Avoiding GEMM: Platform- and problem-oblivious algorithms partition the work/task space along space-filling curves, attuning locality and communication optimally for shared-memory, distributed-memory, or GPU kernels. Generalized Hilbert SFC mapping, coupled with 2.5D/3D CA algorithms, allows a short (∼30 LOC) code path to match or exceed vendor libraries—up to $C \in \mathbb{R}^{m \times n}$ 9 speedup and minimal cache misses even absent platform-specific tuning (Georganas et al., 22 Jan 2026).
High-Precision Emulation using Modular Arithmetic: GEMM-based emulation using the Chinese Remainder Theorem (Ozaki Scheme II) enables high-precision arithmetic by combining results from a parameterizable number $\alpha, \beta \in \mathbb{R}$ 0 of low-precision GEMMs. This approach achieves accuracy and throughput trade-offs unattainable by direct high-precision implementations: e.g., FP64 emulation on NVIDIA GH200 exceeds 80 TFLOPS, outpacing native FP64 by up to $\alpha, \beta \in \mathbb{R}$ 1 (Ozaki et al., 10 Apr 2025).
Temporal Unary Coding for Edge GEMM: For low-precision edge AI, tuGEMM encodes operands temporally as unary pulses, counting co-incident pulses per cycle for exact MAC execution with ultra-low area/power in CMOS (4-bit, $\alpha, \beta \in \mathbb{R}$ 2, 9 mW). Serial/parallel variants trade off area and latency; for INT2/INT4 regimes, this approach surpasses stochastic unary baseline by an order of magnitude in PPA (Nair et al., 2024).
Batched Small GEMM: Specialized interfaces and shared-memory tiling strategies for multiple independent small GEMMs (e.g., finite element, small RNNs) outperform general-purpose batched library routines by up to $\alpha, \beta \in \mathbb{R}$ 3. Interface design (pointer array vs. strided 3D layouts) is critical for high throughput on GPUs (Jhurani et al., 2013).

7. Practical Guidance and Limitations

Unified fused-packing approaches such as GEMMFIP offer a robust, maintainable path to performance—packing only when beneficial, reusing classic blocking parameters, and supporting extensions to complex GEMM, Strassen, and tensor contractions (Xu et al., 2023). On CPUs, microkernel and packing strategies must be retargeted per ISA as dictated by instruction set (AXPY, DOT, TILE, etc.). For mixed-precision, performance/energy/accuracy must be balanced per application needs, using tile-based or modular approaches as appropriate (Zhang et al., 20 Aug 2025, Ozaki et al., 10 Apr 2025).

GEMMFIP-like schemes facilitate single code paths across all sizes, avoiding performance valleys and runtime complexity. Extensions to bandwidth-bound operations and multithreading require further microkernel specialization. On emerging platforms (AI engines, FPGAs, tile engines), memory placement and hardware mapping dominate throughput, mandating close co-design between algorithm, data layout, and placement (Mhatre et al., 13 Apr 2025, Deng et al., 25 Dec 2025).

For further developments, reproducible compositional frameworks (e.g., TVM), empirical search, and ML-based autotuners are indispensable in sustaining architectural portability and optimality.

References: