tinygemm: Efficient GEMM Specializations for Accelerators

Updated 13 November 2025

tinygemm is a family of compact, high-performance GEMM libraries tailored for low-latency execution on small matrices and batched inference.
It employs uniform-stride batching, adaptive tiling via code generation, and quantized GEMM with learned low-bit formats to maximize throughput.
Deployments on NVIDIA GPUs, ARMv8 CPUs, and LLM inference environments achieve significant speedups over conventional BLAS, cuBLAS, and PyTorch backends.

tinygemm is a family of compact, high-efficiency libraries and design patterns for General Matrix-Matrix Multiplication (GEMM) specialized for low-latency execution on small matrices or small-batch inference on modern hardware accelerators. Deployed in the contexts of NVIDIA GPUs (across Kepler, Ampere, and later), ARMv8 CPUs, and transformer LLM serving, the term “tinygemm” encompasses: (1) a uniform-stride, hand-specialized batched GEMM for small $m$ , (2) architectures developed via code-generation and adaptive tiling for small-block GEMM, and (3) quantized GEMM with learned low-bit formats (any4, any3, any2) optimized for LLMs. Implementations in this lineage (notably (Jhurani et al., 2013, Yao et al., 2022), and (Elhoushi et al., 7 Jul 2025)) combine bespoke hardware mappings, load/store coalescing, and minimal overhead API/ABI design to achieve orders-of-magnitude throughput improvements over generic BLAS, cuBLAS, and PyTorch backends for small $(M,N,K)$ . Applications include finite element simulations, embedded ML, and low-latency LLM inference.

1. Uniform-Stride Batched GEMM for Small Matrices (NVIDIA GPUs)

The earliest “tinygemm” implementation (Jhurani et al., 2013) targets NVIDIA GPUs running CUDA Toolkit 5.0 (Tesla K20c, sm_35), focusing on batch GEMM where each input/output matrix is sub-16 $\,\times\,$ 16. The design replaces the cuBLAS pointer-array API

1	cublasStatus_t cublasTgemmBatched(handle, transa, transb, m, n, k, alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount);

with a “uniform” interface taking contiguous 3D arrays and explicit strides:

1	cudaError_t TGEMM_multi_uniform(transa, transb, m, n, k, alpha, A3D, lda, lda2, B3D, ldb, ldb2, beta, C3D, ldc, ldc2, batchCount);

This formulation avoids host/device pointers lists and pointer indirections, enabling the kernel to use simple arithmetic addressing. CUDA kernels are specialized per- $m$ (template instantiation, $m \leq 16$ ), using blocks of $m^2$ threads (each computing a $C_{ij}$ element). Tiles of $A$ and $B$ are loaded into shared memory, and all dot products are register-accumulated, exploiting full loop unrolling. Functor-based compile-time logic manages transposition, conjugation, and $\alpha/\beta$ cases, resulting in branch-free CUDA execution.

Property	cuBLAS Batched	tinygemm “Uniform”
Arg. type	pointer-of-pointer	base+strides (contiguous)
Launch overhead	High	Low
Stride uniformity	No	Yes

Peak performance observed: $104$ GF/s ( $m=10$ ), $216$ GF/s ( $m=16$ ) single precision (batch $=100{,}000$ ). Speedups over cuBLAS’s batched routine range from $1.3\times$ to $6\times$ ($30$–$600$  $\%$ ), depending on type and shape; performance gain is attributed both to interface design and write-optimized GPU kernels. Extension to $m>16$ is possible with new template specializations and “factorized blocking.”

2. Code Generation and Input-Aware Adaptive Tiling on ARMv8

The second generation of “tinygemm” leverages the IAAT (Input-Aware Adaptive Tuning) framework (Yao et al., 2022) to optimize small GEMM tasks on ARMv8 microarchitectures. The approach is two-phase:

Install-time: Auto-generate a rich table of hand-tuned SIMD/assembly microkernels for nearly all $(m_c, n_c)$ pairs up to target sizes for various BLAS levels (SGEMM/DGEMM/CGEMM/ZGEMM, all $NN/NT/TN/TT$ ).
Run-time: Decompose $(M,N)$ into tiles matched to pre-generated kernel shapes, choosing tiling to minimize data movement:

$(m+n)K + 2mn$

Loads and FMAs are costed in the model; large tiles are chosen to maximize L2-register efficiency, subject to SIMD alignment.

Each microkernel executes its assigned $(m_i, n_j)$ tile without a boundary handling patch (since edge tiles have their own microkernels), eliminating the classic “pack” overhead—a dominant runtime bottleneck (up to $67\%$ for very small sizes in OpenBLAS/BLIS). Kernels are designed to maximize register utilization and pipeline dual FMA units on ARM.

Performance benchmarks (Kunpeng920, ARMv8.2):

Shape (M=N=K)	Speedup vs. OpenBLAS	Speedup vs. ARMPL	Speedup vs. BLIS
SGEMM NN ( $\leq 80$ )	$1.81\times$	$2.30\times$	$20.17\times$
DGEMM NN	$1.48\times$	$1.66\times$	$15\times$

This approach attains up to $90\%$ of architectural roofline in all but $TN$ (transpose-N) cases, where limitations in vector load coalescing necessitate scalar fallback.

3. 4-Bit and Lower-Bit GEMM for LLM Inference with any4/any3/any2

The most recent iteration of tinygemm, detailed in (Elhoushi et al., 7 Jul 2025), targets LLM inference with quantized weights. Here, the library supports:

Standard formats: fp16, bf16, int4, fp4, nf4
Learned formats: any4, any3, any2, which introduce per-row learned LUTs (16/8/4 entries) for 4/3/2-bit quantization.

Quantization proceeds via group-wise asymmetric scaling:

$\alpha_{i,G} = \frac{\max_{j \in G} w_{i,j} - \min_{j \in G} w_{i,j}}{Q_{\max} - Q_{\min}}, \quad \beta_{i,G} = \min_{j \in G} w_{i,j}$

Each element is coded as a scaled, bias-corrected value plus a LUT lookup; LUTs are derived via weighted K-means, optimized for forward-pass activation statistics from a single curated calibration sample.

CUDA kernels use "mma" tensor-core instructions, with kernel launch path switched based on batch $M$ :

For $M \leq 8$ , weights are loaded as left matrices to exploit the $16 \times 16$ mma kernel (maximum tile).
For $M > 8$ , conventional $8 \times 16$ mma right-map is used.

All key quantization and dequantization steps (including LUT selection) are performed in-register, using CUDA warp-wide shuffles ( $\_\_shfl\_sync$ ), with global loads fully coalesced (16 B) and no shared-memory usage for small $M$ .

The tinygemm C++/Python API enables transparent switching between quantized GEMM and native fp16/bf16 for embedding or output layers. Host-side routines upload indices, group scales/biases, and per-row LUTs; at inference, a unified GEMM invocation fuses dequantization and matmul.

4. Practical Applications and Performance Characteristics

tinygemm finds direct application where:

Batched small-matrix GEMM dominates (finite elements, microkernel ML).
LLM transformers are deployed at small batch for latency-critical serving.
The target hardware exposes SIMD, tensor-core, or similar acceleration where packing, pointer indirection, or boundary scalar code limits mainstream BLAS kernel utility.

On A100 (PyTorch 2.3, M=1–8):

Format	tinygemm Speedup over PyTorch bf16 "mma"
int4	$\approx 3\times$
nf4, any4	$\approx 2\times$

End-to-end LLM inference throughput shows $1.8\times$ – $2.5\times$ gains vs. baseline (bf16, batch $1{-}4$ ), primarily from elimination of quantization overhead and avoidance of non-coalesced loads.

5. Integration Patterns, Limitations, and Portability

Integration of tinygemm depends on architecture and target API:

On NVIDIA GPUs: Replace batched cuBLAS or PyTorch matmul with tinygemm’s API, providing uniform-stride activations and prepacked weight/LUT buffers.
On ARM: Build system compiles IAAT-generated kernel tables and integrates a runtime tiler/dispatcher. No JIT required; all kernel functions are static at link time.

Current known limitations by context:

NVIDIA GPU “K20c”: Best performance for $m \leq 16$ ; scaling to large $m$ requires new kernel specializations.
ARMv8 IAAT: Transpose-N fallback is scalar and less efficient; porting to SVE/x86 requires ISA-specific codegen.
LLM any4/any3/any2: Focused on Ampere+ GPUs; for $M \gg 8$ , shared-memory tiling is not yet enabled. Very large N can see LUT memory overhead ($16$ fp16 per row).
Empirical LLM support up to $70$B-parameter models; MLLMs and next-generation ASICs are subject to ongoing evaluation.

Potential extensions noted include hybrid pack/no-pack adaptive at run time, multi-threaded block dispatch for $M,N > 80$ , and integration with orthogonal quantization methods (AWQ, GPTQ).

6. Technical Summary and Comparative Features

The architectural commonality in all “tinygemm” incarnations is the removal of general pointer indirection, scalar cleanups, and packing costs in exchange for either:

Uniform, stride-based memory layouts with per-size optimized kernels (Jhurani et al., 2013),
Exhaustively code-generated microkernels plus input-aware tiling (Yao et al., 2022), or
Fused quantized dequantization and matmul using on-the-fly LUTs (Elhoushi et al., 7 Jul 2025).

This specialization is justified by the market of ML and HPC tasks now saturated with small-matrix GEMM bottlenecks. Table below summarizes the evolution:

Implementation	Hardware	API strategy	Kernel specialization	Formats supported	Notable speedup
(Jhurani et al., 2013)	NVIDIA K20c	3D base+stride arrays	$m \leq 16$ , template	fp32, fp64, cfloat/cdouble	$1.3$– $6\times$ cuBLAS
(Yao et al., 2022) (IAAT)	ARMv8	Static kernel table	auto-generated, all $(m,n)$	s/d/c/z types	$1.8$– $20\times$ BLIS
(Elhoushi et al., 7 Jul 2025)	Ampere+ GPU	Descriptor+LUT API	mma tensor-core, LUTs	fp16/bf16, int4, anyn	$2$– $3\times$ PyTorch bf16

A plausible implication is that further generalization to new ISA backends (SVE, Power9) and dynamic kernel dispatch (for “medium” $M,N$ ) will dominate future tinygemm research and deployment. At present, tinygemm constitutes an exemplar of the effectiveness of hardware-proximate, workload-specific GEMM specialization in contemporary ML and simulation pipelines.