Grouped GEMM Library: Optimized Batch Ops

Updated 4 October 2025

Grouped GEMM Library is a computational framework that generalizes and accelerates batched matrix-matrix multiplications by slicing high-dimensional tensors into 2D operations.
It employs advanced slicing techniques and storage-aware scheduling to optimize memory access and maximize hardware throughput for diverse computational workloads.
Its design supports applications in quantum chemistry, general relativity, and machine learning by aggregating similar kernel calls and managing fallbacks to maintain near-peak performance.

A Grouped GEMM Library refers to a class of computational frameworks and implementation strategies that generalize, optimize, and accelerate grouped or batched general matrix–matrix multiplication (GEMM) operations. These libraries are central to high-performance computing (HPC), scientific simulations, and modern AI workloads, including deep learning and LLMs, where they must efficiently perform many small or heterogeneous GEMM operations that differ in size, shape, or numerical precision. Several foundational works have established systematic methodologies for mapping these grouped operations onto highly tuned computational kernels—most notably the GEMM routine from BLAS—and have developed advanced techniques for optimizing memory access, kernel selection, and mapping higher-dimensional tensor contractions to batched or grouped GEMM calls (Napoli et al., 2013).

1. Mathematical Foundations and Classification of Grouped GEMM Operations

The mathematical basis of a grouped GEMM library lies in the extension of the standard matrix-matrix product to arbitrarily grouped or batched forms. The fundamental GEMM operation is typically expressed as

$C = \alpha\,A B + \beta\,C,$

where $A$ , $B$ , and $C$ are conformant matrices, and $\alpha$ , $\beta$ are scalars.

In grouped or batched contexts, libraries must efficiently schedule and compute multiple such operations, typically distinguished by their operand shapes, numerical types, or memory layouts. In the context of tensor contractions, grouped GEMM arises as a systematic slicing of higher-dimensional contractions into a sequence of two-dimensional GEMM calls. Specifically, consider a contraction in Einstein notation:

$R^{\alpha \ldots} = T^{\alpha\beta \ldots \gamma} S_{\gamma\ldots},$

which, after appropriate index reordering ("slicing"), can be expressed as a set of independent GEMMs across a batch of slices (Napoli et al., 2013).

The referenced work classifies these contractions into three groups based on the number of free (uncontracted) indices per operand ( $\Delta(t_i) = N(t_i) - p$ , with $N(t_i)$ the number of indices and $p$ the number contracted):

Class	Δ( $t_1$ )	Δ( $t_2$ )	Output Shape	Preferred Routine	Slicing Strategy
1	0	0	Scalar	BLAS 1 (dot)	Full slicing; no 2D slices
2	≥1	0	Vector/Tensor	BLAS 2 (GEMV)	All free indices but one sliced
3	≥1	≥1	Tensor	BLAS 3 (GEMM)	Slicing to 2D with 1 free, 1 contracted index per slice

The only case where GEMM is explicitly used is Class 3, and in the context of grouped operations, this means scheduling a collection of GEMMs where slices may vary in shape and storage layout (Napoli et al., 2013).

2. Principles of Slicing and Mapping to GEMM

A core principle for grouped GEMM libraries is "slicing" higher-dimensional tensor operands to extract a maximal number of GEMM-compatible 2D slices. For an efficient GEMM mapping, three formal requirements are established (Napoli et al., 2013):

R1: The stride-1 (memory-contiguous) dimension must not be sliced.
R2: Each operand must be sliced along all but two modes (yielding 2D slices for GEMM).
R3: Each 2D slice must correspond to exactly one free and one contracted index.

Deviating from R1 (slicing the fast dimension) imposes a need for "copy + GEMM" (F1 in the paper) with a performance penalty. If R2 or R3 cannot be enforced, the contraction must fall back to lower-level routines (BLAS 2 or 1), degrading throughput.

Optimization then involves determining a "slicing vector" that aligns the unsliced axes with the largest sizes and the stride-1 dimension, minimizing copying and maximizing memory bandwidth efficiency. The slicing strategy, including explicit rules (recipes) for different contraction classes and index permutations, is detailed in the canonical work (Napoli et al., 2013), with explicit case studies in quantum chemistry and general relativity applications.

3. Data Storage, Memory Alignment, and Performance Implications

Performance in grouped GEMM libraries depends critically on the data storage scheme. If tensors are generated and stored so that contracted modes do not correspond to the stride-1 dimension, slicing can exploit contiguous memory access without extra copying (Napoli et al., 2013). Otherwise, additional data movements or memory packs are required, reducing effective GEMM throughput.

Attaining near-peak hardware performance (≥90% of theoretical) is possible when these storage and slicing principles are respected. Explicit recommendations are made to design scientific codes and tensor libraries so that the fastest-changing dimension is always free or corresponds to a non-contracted mode. The main performance penalty comes from storage layouts forcing the use of "copy + GEMM" or falling back to lower BLAS levels.

Grouped GEMM also amplifies memory efficiency and cache utility, as multiple slices can be prefetched or cached together, increasing temporal locality. For maximal efficiency, grouping should also consider hardware constraints at the microkernel and cache-blocking level.

4. Recipes for Efficient Grouped GEMM Library Design

The grouped GEMM library design space involves:

Index Slicing Recipes: Given a collection of tensor contractions, choose the index permutations ("slicing vectors") that ensure R1-R3 are satisfied for as many contractions as possible.
Storage-Aware Scheduling: At library generation or runtime, adapt to the storage layout, possibly reordering or copying slices to match contiguous access for the given GEMM kernel.
Kernel Aggregation: Aggregate GEMM calls of similar size or shape to amortize overhead and align with vectorization or parallelization strategies.
Fallback Management: For contractions that cannot be mapped to BLAS 3, identify where BLAS 2 (GEMV) or BLAS 1 (dot) must be used, and plan accordingly for expected performance drops.

Explicit implementation examples are provided in (Napoli et al., 2013), ranging from simple matrix multiplication and higher-order double contractions to 4th-rank tensor contractions in quantum chemistry. These demonstrate the sharp efficiency contrast between optimal and suboptimal slicing/storage schemes.

5. Application Domains and Practical Considerations

Grouped GEMM libraries are indispensable in domains where high-order tensor contractions dominate computational load:

Quantum Chemistry: Coupled cluster and configuration interaction methods involve repeated multi-mode contractions where both implementation and storage strategy are decisive to reach hardware efficiency limits.
General Relativity and Multiphysics Simulations: High-rank tensor contractions with irregular index patterns occur, with performance highly sensitive to optimal slicing and storage.
Machine Learning: Deep learning backends and tensor compilers increasingly rely on generalized contraction patterns. Modern frameworks select slicing and batching strategies at the operator or graph level to map to grouped GEMM.

For these applications, the guidelines derived from (Napoli et al., 2013) provide foundational recipes for designing both user-level scientific code and the underlying tensor/BLA libraries.

6. Future Directions and Open Challenges

Several research challenges and directions are suggested (Napoli et al., 2013):

Automatic Slicing Selection: Developing methods for automating the choice of optimal slicing and data layout at compile time or runtime, possibly leveraging compiler analyses or domain-specific languages.
Higher-Level BLAS Abstractions: Extending BLAS beyond level 3 by natively supporting multidimensional tensor contractions, thus avoiding explicit slicing in user code.
Domain-Specific Library Development: Building libraries that exploit symmetry, sparsity, or other domain characteristics in grouped GEMM contexts.
Integration with Tensor Compilers and Autotuning: Embedding slicing and storage awareness into emerging ML operator compilers and autotuning systems to maximize performance portability and resilience to dataset variance.

This synthesis is based entirely on the key findings and implementation details from "Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions" (Napoli et al., 2013).

PDF Markdown Chat (Pro)

References (1)

Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions (2013)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Grouped GEMM Library.