Uniform-Stride Batched GEMM

Updated 12 December 2025

Uniform-Stride Batched GEMM is a specialized linear algebra technique that performs batches of small, fixed-size matrix multiplications with uniform data layouts and fixed memory strides.
It minimizes overhead by using stride-based access, enabling high-throughput performance through vectorization, optimized memory bandwidth, and register blocking.
This design is pivotal for deep learning, scientific simulation, and tensor contractions, achieving speedups from 3.5× to 600% compared to traditional pointer-array methods.

Uniform-stride batched GEMM (general matrix-matrix multiplication) designates a class of BLAS-like primitives and libraries that perform a batch of independent small matrix multiplications, where all submatrices share identical shape, data layout, and are stored at regular fixed strides in contiguous memory. This design is recognized for enabling high-throughput linear algebra on modern CPUs and GPUs, optimizing for bandwidth, instruction-level parallelism, and vectorization. Uniform-stride interfaces have replaced older pointer-array “batched GEMM” interfaces in several performance-critical domains, including deep learning, scientific simulation, and tensor contraction, by drastically reducing overheads and enabling hardware-efficient load/store patterns (Banchelli et al., 10 Jan 2025, Georganas et al., 2019, Shi et al., 2016, Jhurani et al., 2013, Boukaram et al., 2021).

1. Formal Definition and Memory Layout

Uniform-stride batched GEMM computes, for $i = 0, \ldots, B-1$ : $C[i] \leftarrow \alpha\, \mathrm{op}(A[i])\, \mathrm{op}(B[i]) + \beta\, C[i]$ where each $A[i]$ is $m \times k$ , $B[i]$ is $k \times n$ , $C[i]$ is $m \times n$ , and $\mathrm{op}(\cdot)$ denotes transpose or conjugate-transpose. All submatrices are packed in memory so that:

$A[i] = A_\text{base} + i \cdot \text{stride}_A$
$B[i] = B_\text{base} + i \cdot \text{stride}_B$
$C[i] = C_\text{base} + i \cdot \text{stride}_C$

With column-major layout, for each $i$ , the $(r, c)$ entry of $A[i]$ is at: $A_\text{base} + i \cdot \text{stride}_A + c \cdot \text{lda} + r$ and similarly for $B[i]$ and $C[i]$ . The strides are typically set to cover each submatrix (e.g., $\text{stride}_A = \text{lda} \cdot k$ for $A_i$ of size $m\times k$ ). No pointer arrays or explicit packing between matrices are required (Banchelli et al., 10 Jan 2025, Shi et al., 2016, Jhurani et al., 2013).

2. API Signatures and Implementation Interfaces

Uniform-stride batched GEMM interfaces are instantiated in both C/C++ and Fortran. For example:

void TGEMM_multi_uniform( 
    char      transa, char transb,
    int       m, int n, int k,
    const T*  alpha,
    const T*  A3D, int lda, int lda2,
    const T*  B3D, int ldb, int ldb2,
    const T*  beta,
    T*        C3D, int ldc, int ldc2,
    int       batchCount );

Here,

\text{lda2}

\text{ldb2}

\text{ldc2}

denote the strides between successive matrices in each batch (Jhurani et al., 2013). Analogous APIs exist in modern BLAS and vendor libraries such as cublasGemmStridedBatched for CUDA, and STRIDEDBATCHEDGEMM in MKL/CPU backends (Shi et al., 2016, Boukaram et al., 2021). These APIs contrast with pointer-of-pointers variants, which incur extra pointer dereference and cannot be vectorized or scheduled as efficiently on hardware.

3. Algorithmic and Vectorization Strategies

Uniform-stride enables batch-level fusion at the outer loop:

1
2
3

for (e = 0; e < E; ++e) {
    // fully unrolled small-MxNxK microkernel operating on A[e], B[e], C[e]
}

For small

m,n,k

(e.g.,

m,n,k\leq 20

) common in scientific applications and DL, the inner loops are unrolled into a microkernel, typically register-blocked so that each output accumulator

rC[i][j]

remains in a register or vector lane. On long-vector architectures (e.g., RISC-V VL=256),

A

is loaded as scalars, broadcast for FMA (e.g., RVV vfmacc.vf), and

B

and

C

are loaded/stored via strided instructions (e.g., vlse.v, vsse.v). Indexed loading is only used as a fallback for non-contiguous data (Banchelli et al., 10 Jan 2025).

No explicit packing is required: with uniform-stride, all batch operands can be streamed through regular/bulk loads and stores, enabling hardware prefetching, full memory coalescing (on GPU), and SIMD vectorization. Microkernel parameters (tile size, register blocking) are chosen by register pressure and vector length to maximize occupancy and instruction-level parallelism (Georganas et al., 2019).

4. Performance Considerations and Modeling

Uniform-stride batched GEMM achieves bandwidth and kernel launch efficiencies unattainable by pointer-array interfaces. The critical performance metrics are:

Floating-point throughput:

$\text{GFLOP/s} = \frac{2\,B\,m\,n\,k}{T_\text{batch}}$

Arithmetic intensity:

$\mathrm{AI} = \frac{2\,m\,n\,k}{8\,(m\,k + k\,n + m\,n)}$

with units of flops/byte for double precision.

Speedups relative to reference BLAS “loop-of-calls” or pointer-array batched GEMM range from $3.5\times$ to $32.6\times$ on RISC-V VL256 (Banchelli et al., 10 Jan 2025), and $30\%$ – $600\%$ on NVIDIA GPUs for $m\leq16$ (Jhurani et al., 2013). On AVX512 CPUs, batch-reduce GEMMs in deep learning achieve $63$– $84\%$ of architectural peak, outperforming vendor-optimized routines (e.g., MKL-DNN) by $1.1\times$ – $1.7\times$ depending on primitive and batch shape (Georganas et al., 2019).

Performance is compute-bound when arithmetic intensity exceeds the bandwidth:flop ratio (e.g., for NVIDIA V100, AI $> 0.0087$ ). For small matrices, batching is essential to achieve sufficient occupancy and amortize kernel-launch and memory setup. For very large batches or larger matrix sizes (e.g., $N>512$ ), the difference between batched and “flattened” GEMM is reduced (Shi et al., 2016, Boukaram et al., 2021).

Platform	Example size	Speedup vs Ref	Peak throughput
RISC-V VL256	$2\times2\times2$ to $20\times9\times10$	$3.5\times$ – $32.6\times$	$8$ elements/cycle (GFLOPS in batch)
NVIDIA K20c	$10\times10$ , $16\times16$	$30\%$ – $600\%$	$104$–$216$ GFLOP/s
V100	$512$–$128$	N/A	$1.0$–$7.1$ TFLOP/s
SKX-AVX512	Various (DL)	$1.1$– $1.7\times$ vs MKL	$70$– $84\%$ architectural peak

5. Applications and Primitives

Uniform-stride batched GEMM has become foundational for:

Finite element and scientific simulation: Small-matrix batched DGEMMs dominate compute in explicit time-stepping, local element-wise updates, or repeated integration kernels (e.g., SeisSol earthquake simulation with $1.8\times$ overall speedup, kernel fraction $53.5\%$ ) (Banchelli et al., 10 Jan 2025).
Tensor contractions: STRIDEDBATCHEDGEMM eliminates the need for explicit copy or transpose, accelerating mode contractions and higher-order SVD/Tucker decompositions by $2$– $100\times$ (Shi et al., 2016).
Deep learning: Batch-reduce GEMM is leveraged as the universal building block across convolutional, LSTM, and fully-connected primitives by tuning tile/blocking around a uniform-stride microkernel. All major operations in LSTM, CNN (direct convolution), and MLP layers reduce to specialized loop nests invoking the same uniform-stride kernel. Empirically, this reduces DL primitive source code and outperforms hand-tuned vendor libraries across architectures (Georganas et al., 2019).
Tile low-rank (TLR) factorization: Cholesky and $LDL^T$ factorizations for compressed tiles make heavy use of uniform-stride batched GEMM for grouped operations with identical $(m,n,k)$ , exploiting dynamic batching and packing for optimal GPU utilization (Boukaram et al., 2021).

6. Architectural Considerations, Limitations, and Tuning

Architectural bottlenecks include register spilling (for large $m,n,k$ ), the cost of indexed loads, and batch size ( $B$ ) under-utilization for very small matrices on GPU/CPU. Key heuristics and implementation parameters:

Prefer contiguous layout whenever possible with $A$ , $B$ , $C$ stored back-to-back, to minimize use of indexed loads (CPU) or enable fully coalesced memory accesses (GPU) (Banchelli et al., 10 Jan 2025, Boukaram et al., 2021).
Tile/block size, and batch thresholds ( $B_{\min}$ ) must be tuned according to register pressure, cache size, and device occupancy—e.g., $m_b=64, n_b=6$ for AVX512, $b=512$ for V100 tile low-rank kernels (Georganas et al., 2019, Boukaram et al., 2021).
For highly irregular batch workloads (e.g., varying matrix ranks in TLR), tasks are grouped by $(m,n,k)$ triples before launch; groups smaller than a threshold are handled by pointer-based fallback or serial execution (Boukaram et al., 2021).
On CPU, batching along the highest-stride mode yields further cache/direct-mapped speedup; on GPU, batches must be large enough to saturate streaming multiprocessors for maximum bandwidth. Explicit software prefetch and loop unrolling are used for optimal kernel emission (Shi et al., 2016, Georganas et al., 2019).
Limitations arise when tensors contractions require batching along the first index of a column-major matrix; "extended-op" flags or alternate packing may be required (Shi et al., 2016).

7. Comparative Analysis and Research Milestones

Uniform-stride batched GEMM represents a convergence point between high-performance library design and domain-specific architectural tuning. Early works (e.g., (Jhurani et al., 2013)) demonstrated dramatic speedups on small-matrix workloads by replacing pointer-array APIs with stride-based interfaces and register-blocked tiling, especially on early GPUs. Subsequent extensions generalized these techniques for CPUs and complex deep learning primitives, showing that a single optimized microkernel and loop structure suffices for broad classes of algorithms (Georganas et al., 2019). Adoption into vendor BLAS (e.g., cuBLAS, MKL's STRIDEDBATCHEDGEMM) has made these techniques mainstream across simulation, ML, and compressed/low-rank numerical linear algebra (Shi et al., 2016, Boukaram et al., 2021).

Theoretical performance modeling with arithmetic intensity and Amdahl's law has guided bottleneck analysis and kernel tuning. Uniform-stride designs now define state-of-the-art for batched dense linear algebra across heterogeneous architectures and remain a key topic in research on batched and tiled numerical algorithms.