GEMM Pipeline: Architectures & Innovations

Updated 3 July 2026

General Matrix Multiply (GEMM) pipelines are high-performance computing kernels that perform C = αAB + βC, emphasizing data blocking, precision management, and energy efficiency.
Modern implementations optimize data packing, microkernel computation, and scheduling to exploit CPU, GPU, FPGA, and custom accelerator features.
Advanced techniques such as mixed-precision computation, CRT-based high-precision emulation, and outer-product pipelines drive near-peak performance and improved energy metrics.

General Matrix Multiply (GEMM) is the computational kernel $C \leftarrow \alpha AB + \beta C$ , where $A \in \mathbb{F}^{m\times k}$ , $B \in \mathbb{F}^{k\times n}$ , and $C \in \mathbb{F}^{m\times n}$ . GEMM underpins high-performance computing (HPC), scientific workloads, machine learning, signal processing, and deep neural networks. Modern GEMM pipelines are finely tuned to exploit architectural features of CPUs, GPUs, FPGAs, and custom accelerators, balancing compute throughput, data movement, precision, and energy efficiency. This article surveys the principal GEMM pipeline architectures, mathematical decompositions, software and hardware scheduling methods, and recent algorithmic innovations, emphasizing rigorously quantified performance, complexity, and design trade-offs as established in contemporary research.

1. Mathematical Decomposition and Blocking Strategies

GEMM implementations universally adopt data-blocking (tiling) to maximize data locality and register cache efficiency. The canonical five-loop blocking model, formalized in GotoBLAS/BLIS and repeated in TVM-based generators, partitions $A$ , $B$ , and $C$ into macro-panels aligned to multi-level caches (L1/L2/L3). The kernel computes $C_{i,j} = \sum_{p} A_{i,p} B_{p,j} + C_{i,j}$ ; at each macro-loop iteration, subblocks $A_{i,p}$ ( $M_C \times K_C$ ), $A \in \mathbb{F}^{m\times k}$ 0 ( $A \in \mathbb{F}^{m\times k}$ 1), and $A \in \mathbb{F}^{m\times k}$ 2 ( $A \in \mathbb{F}^{m\times k}$ 3) are packed into contiguous buffers to match vector lanes and minimize conflict misses (Xu et al., 2023, Alaejos et al., 2023).

Within each panel, the microkernel—parameterized by $A \in \mathbb{F}^{m\times k}$ 4—streams $A \in \mathbb{F}^{m\times k}$ 5-length micro-panels, unrolling register-level FMA loops to sustain peak SIMD/AMX/TC execution width (Alaejos et al., 2023, Zhang et al., 20 Aug 2025). For hardware tensors (e.g., NVIDIA Tensor Cores, ARM SME), matrix tiling aligns with architectural features: SME’s $A \in \mathbb{F}^{m\times k}$ 6 ZA tiles, or AIE’s microkernels (16×4 INT16 on Versal) (Deng et al., 25 Dec 2025, Lei et al., 2023).

The selection of block sizes $A \in \mathbb{F}^{m\times k}$ 7 is analytically optimized, balancing operational intensity

$A \in \mathbb{F}^{m\times k}$ 8

subject to cache and TLB constraints (Deng et al., 25 Dec 2025, Alaejos et al., 2023).

2. Pipeline Stages and Dataflow Architectures

The modern GEMM pipeline comprises several explicitly orchestrated stages:

Packing (or Fused Pack-Compute): Panels of $A \in \mathbb{F}^{m\times k}$ 9 and $B \in \mathbb{F}^{k\times n}$ 0 are copied into scratchpad buffers. GEMMFIP fuses packing into the first microkernel pass, eliminating up-front overhead for small/medium matrices and avoiding the "crossover" performance dip between SKINNY/large kernels (Xu et al., 2023).
Microkernel Computation: Highly unrolled, vectorized microkernels (manual or auto-generated) operate on packed tiles, maintaining the working $B \in \mathbb{F}^{k\times n}$ 1 block in registers. For hardware tensor engines, microkernels orchestrate FMOPA (SME), MMUL (AIE), or INT8/FP16 pipelines (Deng et al., 25 Dec 2025, Mhatre et al., 13 Apr 2025, Lei et al., 2023).
Write-back and Unpacking: The microkernel writes updated $B \in \mathbb{F}^{k\times n}$ 2 tiles to memory. In sequential pipelines, as in LP-GEMM, unpacking can be delayed and propagated so intermediate GEMMs operate on internal packed layouts, amortizing conversion (Carneiro et al., 6 Apr 2026).
Scheduling and Overlap: Batched/strided GEMMs, double-buffering, and stream-ordered launches hide host/device and packing/conversion latency (Ozaki et al., 10 Apr 2025, Mhatre et al., 13 Apr 2025).

Table: Pipeline Composition in Key GEMM Implementations

Implementation	Packing Strategy	Microkernel Specialization	Write-back/Unpacking
BLIS/TVM	Pre-pack/fused	Hand/unrolled FMA	Immediate
GEMMFIP	Fused	4 variants: pack-A/B/AB	Per microkernel usage
SME MpGEMM	On-the-fly	FMOPA tile-dense	ZA stores
LP-GEMM	Propagated	Prop. µ-kernel (mid)	End GEMM only (final)
GAMA (AIE2)	Ping-pong buffers	MMUL+cascade systolic	Cascade/PLIO/write-out
Ozaki‐II CRT	INT8 modular	s × INT8/F64 GEMMs	CRT recombination

3. Precision, Mixed-Precision, and Emulation Workflows

Precision selection—single, double, INT8, BF16, and custom formats—governs performance and stability. Recent advances include:

Tile-wise Mixed-Precision: GEMM-MP assigns precision per tile; data is broadcast, converted on-receipt, and microkernel-computed in the C tile's format (Zhang et al., 20 Aug 2025). PaRSEC manages workflow/dataflow and enables overlap and heterogeneity.
Emulation of High-Precision GEMM: Ozaki Scheme II (OS II) emulates FP64 (or higher) GEMM using modular INT8 TCs and the Chinese Remainder Theorem (CRT): (1) scale and quantize A,B, (2) perform $B \in \mathbb{F}^{k\times n}$ 3 modular GEMMs with coprime moduli, (3) reconstruct via CRT, (4) inverse-scale (Ozaki et al., 10 Apr 2025). This pipeline achieves aggregate FP64 emulation at 7.4–9.8 TFLOPS (RTX 4090), 56.6–80.2 TFLOPS (GH200), outperforming native FP64.
Dynamic Precision Control: Hardware-tunable accumulators (bfloat16, IEEE-754, posit) in FPGA-based FDP generators allow per-dot-product configuration for accuracy/area trade-off (Ledoux et al., 2023).

4. Advanced Scheduling, Parallelization, and Communication Minimization

Efficient utilization of compute and memory bandwidth is achieved via:

Hierarchical Scheduling: Modern frameworks (e.g., GAMA, ACAP ML models) partition the compute grid (AIE rows/columns, packs) and use double-buffered PLIOs, cascade dataflows, and staggered kernel mapping to balance routing bandwidth and minimize stalls (Mhatre et al., 13 Apr 2025, Papalamprou et al., 10 Nov 2025).
Parallelization and Data Locality: SFC-CA GEMM introduces space-filling curve (SFC, generalized Hilbert curve) partitioning of blocked tiles, yielding platform- and shape-oblivious parallel schemes which preserve cache/locality and minimize L2/DDR transfers (Georganas et al., 22 Jan 2026). Combined with 2.5D/CA-algorithms (output-tiling replication, per-layer reductions), this achieves communication-optimality and near-perfect scaling to 128 cores, consistently outperforming vendor-tuned libraries.
Sparsity Utilization: Accelerators like SPOTS overlap IM2COL patch generation and GEMM in hardware; feature-map and filter sparsity eliminate unnecessary MACs via controller-flagged zero skipping, dynamic PE gating, and subarray partitioning (Soltaniyeh et al., 2021).

5. Specialized/Hardware-Oriented Innovations

Pipeline design is increasingly hardware-driven:

Tensor Core, SME, AMX, AIE pipelines: Careful microkernel generation, register-level mapping, and buffer placement are essential to avoid bank conflicts and routing congestion. SME's multi-tile ZA accumulators and ARM's multi-vector LD1*Z & FMOPA/ZA constructs yield up to 1.23× over Apple Accelerate (Deng et al., 25 Dec 2025). AMD Versal AIE2 pipelines with pack/cascade tiling, custom buffer placement, and staggered kernel wiring achieve 85–86% of 195 TOPS/bf16 peak, with up to 53.6% higher throughput efficiency than AIE1 (Mhatre et al., 13 Apr 2025).
Minimal Buffering Outer-Product Pipelines: O-POPE achieves 1 GHz operation with less than 2% buffer fraction for 2048 MACs by repurposing FPU pipeline registers as the synchronization buffer—no external FIFOs. Output-stationary outer-product execution ensures 99.97% utilization of floating-point units, delivering both high throughput and energy efficiency (Cammarata et al., 1 Jun 2026).
Domain-Specific GEMM Pipelines: MelT/MCFCT recasts audio frontends as single-stage GEMM pipelines—windowed Mel-NDFT using precomputed basis matrices enables up to 3.75× faster and 3.52× more energy-efficient inference, saturating modern accelerator throughput (Camargo et al., 31 May 2026).

6. Algorithmic Innovations for Sequential/Chained GEMMs

Redundant repacking and reordering dominate sequential GEMM workloads, especially in transformers and MLPs. LP-GEMM introduces a decomposition into ini/mid/end kernels, eliminating redundant data layout conversion: only the first (ini) GEMM packs, all mid GEMMs stream in propagated layouts, and only end GEMMs restore canonical memory order (Carneiro et al., 6 Apr 2026). This yields 2.25× speedup over OpenBLAS in sequential cases and 1.3–2× boosts in multi-GEMM ML subgraphs.

7. Performance, Efficiency, and Complexity Metrics

Large-scale empirical evaluations demonstrate that pipeline choices are decisive for both arithmetic efficiency and memory system utilization:

Peak throughput realization: Hand-tuned or generator-generated microkernels on modern CPU/AI engines routinely hit 80–90% of hardware peak in AVX/SVE/AMX/SME/TC designs (Alaejos et al., 2023, Deng et al., 25 Dec 2025, Mhatre et al., 13 Apr 2025).
Energy and area: Outer-product pipelining and buffer reuse (O-POPE) yield 8% higher energy efficiency and 1.09× performance density over previous FP engines (Cammarata et al., 1 Jun 2026). SFC-CA outperforms oneDNN by up to 2× geometrical mean, eliminating vendor "glass-jaw" dips (Georganas et al., 22 Jan 2026).
Adaptivity: Mixed-precision and CRT-emulated high-precision GEMMs allow fine control of performance/accuracy with negligible added complexity in the runtime pipeline (Zhang et al., 20 Aug 2025, Ozaki et al., 10 Apr 2025).

In sum, the state of the art in GEMM pipelines builds on multilevel blocking, hardware-aware data movement, precision adaptivity, and domain-fitted algorithmic decomposition. These principles, instantiated in software (BLIS, OpenBLAS, TVM, PaRSEC) and hardware (SME, AMX, AIE, GAMA, O-POPE), collectively deliver near-optimal performance and energy efficiency across both general-purpose and specialized computing environments.