tinygemm: Efficient GEMM Specializations for Accelerators
- tinygemm is a family of compact, high-performance GEMM libraries tailored for low-latency execution on small matrices and batched inference.
- It employs uniform-stride batching, adaptive tiling via code generation, and quantized GEMM with learned low-bit formats to maximize throughput.
- Deployments on NVIDIA GPUs, ARMv8 CPUs, and LLM inference environments achieve significant speedups over conventional BLAS, cuBLAS, and PyTorch backends.
tinygemm is a family of compact, high-efficiency libraries and design patterns for General Matrix-Matrix Multiplication (GEMM) specialized for low-latency execution on small matrices or small-batch inference on modern hardware accelerators. Deployed in the contexts of NVIDIA GPUs (across Kepler, Ampere, and later), ARMv8 CPUs, and transformer LLM serving, the term “tinygemm” encompasses: (1) a uniform-stride, hand-specialized batched GEMM for small , (2) architectures developed via code-generation and adaptive tiling for small-block GEMM, and (3) quantized GEMM with learned low-bit formats (any4, any3, any2) optimized for LLMs. Implementations in this lineage (notably (Jhurani et al., 2013, Yao et al., 2022), and (Elhoushi et al., 7 Jul 2025)) combine bespoke hardware mappings, load/store coalescing, and minimal overhead API/ABI design to achieve orders-of-magnitude throughput improvements over generic BLAS, cuBLAS, and PyTorch backends for small . Applications include finite element simulations, embedded ML, and low-latency LLM inference.
1. Uniform-Stride Batched GEMM for Small Matrices (NVIDIA GPUs)
The earliest “tinygemm” implementation (Jhurani et al., 2013) targets NVIDIA GPUs running CUDA Toolkit 5.0 (Tesla K20c, sm_35), focusing on batch GEMM where each input/output matrix is sub-1616. The design replaces the cuBLAS pointer-array API
1 |
cublasStatus_t cublasTgemmBatched(handle, transa, transb, m, n, k, alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount); |
1 |
cudaError_t TGEMM_multi_uniform(transa, transb, m, n, k, alpha, A3D, lda, lda2, B3D, ldb, ldb2, beta, C3D, ldc, ldc2, batchCount); |
This formulation avoids host/device pointers lists and pointer indirections, enabling the kernel to use simple arithmetic addressing. CUDA kernels are specialized per- (template instantiation, ), using blocks of threads (each computing a element). Tiles of and are loaded into shared memory, and all dot products are register-accumulated, exploiting full loop unrolling. Functor-based compile-time logic manages transposition, conjugation, and cases, resulting in branch-free CUDA execution.
| Property | cuBLAS Batched | tinygemm “Uniform” |
|---|---|---|
| Arg. type | pointer-of-pointer | base+strides (contiguous) |
| Launch overhead | High | Low |
| Stride uniformity | No | Yes |
Peak performance observed: $104$ GF/s (), $216$ GF/s () single precision (batch ). Speedups over cuBLAS’s batched routine range from to ($30$–$600$ ), depending on type and shape; performance gain is attributed both to interface design and write-optimized GPU kernels. Extension to is possible with new template specializations and “factorized blocking.”
2. Code Generation and Input-Aware Adaptive Tiling on ARMv8
The second generation of “tinygemm” leverages the IAAT (Input-Aware Adaptive Tuning) framework (Yao et al., 2022) to optimize small GEMM tasks on ARMv8 microarchitectures. The approach is two-phase:
- Install-time: Auto-generate a rich table of hand-tuned SIMD/assembly microkernels for nearly all pairs up to target sizes for various BLAS levels (SGEMM/DGEMM/CGEMM/ZGEMM, all ).
- Run-time: Decompose into tiles matched to pre-generated kernel shapes, choosing tiling to minimize data movement:
Loads and FMAs are costed in the model; large tiles are chosen to maximize L2-register efficiency, subject to SIMD alignment.
Each microkernel executes its assigned tile without a boundary handling patch (since edge tiles have their own microkernels), eliminating the classic “pack” overhead—a dominant runtime bottleneck (up to for very small sizes in OpenBLAS/BLIS). Kernels are designed to maximize register utilization and pipeline dual FMA units on ARM.
Performance benchmarks (Kunpeng920, ARMv8.2):
| Shape (M=N=K) | Speedup vs. OpenBLAS | Speedup vs. ARMPL | Speedup vs. BLIS |
|---|---|---|---|
| SGEMM NN () | |||
| DGEMM NN |
This approach attains up to of architectural roofline in all but (transpose-N) cases, where limitations in vector load coalescing necessitate scalar fallback.
3. 4-Bit and Lower-Bit GEMM for LLM Inference with any4/any3/any2
The most recent iteration of tinygemm, detailed in (Elhoushi et al., 7 Jul 2025), targets LLM inference with quantized weights. Here, the library supports:
- Standard formats: fp16, bf16, int4, fp4, nf4
- Learned formats: any4, any3, any2, which introduce per-row learned LUTs (16/8/4 entries) for 4/3/2-bit quantization.
Quantization proceeds via group-wise asymmetric scaling:
Each element is coded as a scaled, bias-corrected value plus a LUT lookup; LUTs are derived via weighted K-means, optimized for forward-pass activation statistics from a single curated calibration sample.
CUDA kernels use "mma" tensor-core instructions, with kernel launch path switched based on batch :
- For , weights are loaded as left matrices to exploit the mma kernel (maximum tile).
- For , conventional mma right-map is used.
All key quantization and dequantization steps (including LUT selection) are performed in-register, using CUDA warp-wide shuffles (), with global loads fully coalesced (16 B) and no shared-memory usage for small .
The tinygemm C++/Python API enables transparent switching between quantized GEMM and native fp16/bf16 for embedding or output layers. Host-side routines upload indices, group scales/biases, and per-row LUTs; at inference, a unified GEMM invocation fuses dequantization and matmul.
4. Practical Applications and Performance Characteristics
tinygemm finds direct application where:
- Batched small-matrix GEMM dominates (finite elements, microkernel ML).
- LLM transformers are deployed at small batch for latency-critical serving.
- The target hardware exposes SIMD, tensor-core, or similar acceleration where packing, pointer indirection, or boundary scalar code limits mainstream BLAS kernel utility.
On A100 (PyTorch 2.3, M=1–8):
| Format | tinygemm Speedup over PyTorch bf16 "mma" |
|---|---|
| int4 | |
| nf4, any4 |
End-to-end LLM inference throughput shows – gains vs. baseline (bf16, batch ), primarily from elimination of quantization overhead and avoidance of non-coalesced loads.
5. Integration Patterns, Limitations, and Portability
Integration of tinygemm depends on architecture and target API:
- On NVIDIA GPUs: Replace batched cuBLAS or PyTorch matmul with tinygemm’s API, providing uniform-stride activations and prepacked weight/LUT buffers.
- On ARM: Build system compiles IAAT-generated kernel tables and integrates a runtime tiler/dispatcher. No JIT required; all kernel functions are static at link time.
Current known limitations by context:
- NVIDIA GPU “K20c”: Best performance for ; scaling to large requires new kernel specializations.
- ARMv8 IAAT: Transpose-N fallback is scalar and less efficient; porting to SVE/x86 requires ISA-specific codegen.
- LLM any4/any3/any2: Focused on Ampere+ GPUs; for , shared-memory tiling is not yet enabled. Very large N can see LUT memory overhead ($16$ fp16 per row).
- Empirical LLM support up to $70$B-parameter models; MLLMs and next-generation ASICs are subject to ongoing evaluation.
Potential extensions noted include hybrid pack/no-pack adaptive at run time, multi-threaded block dispatch for , and integration with orthogonal quantization methods (AWQ, GPTQ).
6. Technical Summary and Comparative Features
The architectural commonality in all “tinygemm” incarnations is the removal of general pointer indirection, scalar cleanups, and packing costs in exchange for either:
- Uniform, stride-based memory layouts with per-size optimized kernels (Jhurani et al., 2013),
- Exhaustively code-generated microkernels plus input-aware tiling (Yao et al., 2022), or
- Fused quantized dequantization and matmul using on-the-fly LUTs (Elhoushi et al., 7 Jul 2025).
This specialization is justified by the market of ML and HPC tasks now saturated with small-matrix GEMM bottlenecks. Table below summarizes the evolution:
| Implementation | Hardware | API strategy | Kernel specialization | Formats supported | Notable speedup |
|---|---|---|---|---|---|
| (Jhurani et al., 2013) | NVIDIA K20c | 3D base+stride arrays | , template | fp32, fp64, cfloat/cdouble | $1.3$– cuBLAS |
| (Yao et al., 2022) (IAAT) | ARMv8 | Static kernel table | auto-generated, all | s/d/c/z types | $1.8$– BLIS |
| (Elhoushi et al., 7 Jul 2025) | Ampere+ GPU | Descriptor+LUT API | mma tensor-core, LUTs | fp16/bf16, int4, anyn | $2$– PyTorch bf16 |
A plausible implication is that further generalization to new ISA backends (SVE, Power9) and dynamic kernel dispatch (for “medium” ) will dominate future tinygemm research and deployment. At present, tinygemm constitutes an exemplar of the effectiveness of hardware-proximate, workload-specific GEMM specialization in contemporary ML and simulation pipelines.