Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

tinygemm: Efficient GEMM Specializations for Accelerators

Updated 13 November 2025
  • tinygemm is a family of compact, high-performance GEMM libraries tailored for low-latency execution on small matrices and batched inference.
  • It employs uniform-stride batching, adaptive tiling via code generation, and quantized GEMM with learned low-bit formats to maximize throughput.
  • Deployments on NVIDIA GPUs, ARMv8 CPUs, and LLM inference environments achieve significant speedups over conventional BLAS, cuBLAS, and PyTorch backends.

tinygemm is a family of compact, high-efficiency libraries and design patterns for General Matrix-Matrix Multiplication (GEMM) specialized for low-latency execution on small matrices or small-batch inference on modern hardware accelerators. Deployed in the contexts of NVIDIA GPUs (across Kepler, Ampere, and later), ARMv8 CPUs, and transformer LLM serving, the term “tinygemm” encompasses: (1) a uniform-stride, hand-specialized batched GEMM for small mm, (2) architectures developed via code-generation and adaptive tiling for small-block GEMM, and (3) quantized GEMM with learned low-bit formats (any4, any3, any2) optimized for LLMs. Implementations in this lineage (notably (Jhurani et al., 2013, Yao et al., 2022), and (Elhoushi et al., 7 Jul 2025)) combine bespoke hardware mappings, load/store coalescing, and minimal overhead API/ABI design to achieve orders-of-magnitude throughput improvements over generic BLAS, cuBLAS, and PyTorch backends for small (M,N,K)(M,N,K). Applications include finite element simulations, embedded ML, and low-latency LLM inference.

1. Uniform-Stride Batched GEMM for Small Matrices (NVIDIA GPUs)

The earliest “tinygemm” implementation (Jhurani et al., 2013) targets NVIDIA GPUs running CUDA Toolkit 5.0 (Tesla K20c, sm_35), focusing on batch GEMM where each input/output matrix is sub-16×\,\times\,16. The design replaces the cuBLAS pointer-array API

1
cublasStatus_t cublasTgemmBatched(handle, transa, transb, m, n, k, alpha, Aarray, lda, Barray, ldb, beta, Carray, ldc, batchCount);
with a “uniform” interface taking contiguous 3D arrays and explicit strides:

1
cudaError_t TGEMM_multi_uniform(transa, transb, m, n, k, alpha, A3D, lda, lda2, B3D, ldb, ldb2, beta, C3D, ldc, ldc2, batchCount);

This formulation avoids host/device pointers lists and pointer indirections, enabling the kernel to use simple arithmetic addressing. CUDA kernels are specialized per-mm (template instantiation, m16m \leq 16), using blocks of m2m^2 threads (each computing a CijC_{ij} element). Tiles of AA and BB are loaded into shared memory, and all dot products are register-accumulated, exploiting full loop unrolling. Functor-based compile-time logic manages transposition, conjugation, and α/β\alpha/\beta cases, resulting in branch-free CUDA execution.

Property cuBLAS Batched tinygemm “Uniform”
Arg. type pointer-of-pointer base+strides (contiguous)
Launch overhead High Low
Stride uniformity No Yes

Peak performance observed: $104$ GF/s (m=10m=10), $216$ GF/s (m=16m=16) single precision (batch =100,000=100{,}000). Speedups over cuBLAS’s batched routine range from 1.3×1.3\times to 6×6\times ($30$–$600$ %\%), depending on type and shape; performance gain is attributed both to interface design and write-optimized GPU kernels. Extension to m>16m>16 is possible with new template specializations and “factorized blocking.”

2. Code Generation and Input-Aware Adaptive Tiling on ARMv8

The second generation of “tinygemm” leverages the IAAT (Input-Aware Adaptive Tuning) framework (Yao et al., 2022) to optimize small GEMM tasks on ARMv8 microarchitectures. The approach is two-phase:

  • Install-time: Auto-generate a rich table of hand-tuned SIMD/assembly microkernels for nearly all (mc,nc)(m_c, n_c) pairs up to target sizes for various BLAS levels (SGEMM/DGEMM/CGEMM/ZGEMM, all NN/NT/TN/TTNN/NT/TN/TT).
  • Run-time: Decompose (M,N)(M,N) into tiles matched to pre-generated kernel shapes, choosing tiling to minimize data movement:

(m+n)K+2mn(m+n)K + 2mn

Loads and FMAs are costed in the model; large tiles are chosen to maximize L2-register efficiency, subject to SIMD alignment.

Each microkernel executes its assigned (mi,nj)(m_i, n_j) tile without a boundary handling patch (since edge tiles have their own microkernels), eliminating the classic “pack” overhead—a dominant runtime bottleneck (up to 67%67\% for very small sizes in OpenBLAS/BLIS). Kernels are designed to maximize register utilization and pipeline dual FMA units on ARM.

Performance benchmarks (Kunpeng920, ARMv8.2):

Shape (M=N=K) Speedup vs. OpenBLAS Speedup vs. ARMPL Speedup vs. BLIS
SGEMM NN (80\leq 80) 1.81×1.81\times 2.30×2.30\times 20.17×20.17\times
DGEMM NN 1.48×1.48\times 1.66×1.66\times 15×15\times

This approach attains up to 90%90\% of architectural roofline in all but TNTN (transpose-N) cases, where limitations in vector load coalescing necessitate scalar fallback.

3. 4-Bit and Lower-Bit GEMM for LLM Inference with any4/any3/any2

The most recent iteration of tinygemm, detailed in (Elhoushi et al., 7 Jul 2025), targets LLM inference with quantized weights. Here, the library supports:

  • Standard formats: fp16, bf16, int4, fp4, nf4
  • Learned formats: any4, any3, any2, which introduce per-row learned LUTs (16/8/4 entries) for 4/3/2-bit quantization.

Quantization proceeds via group-wise asymmetric scaling:

αi,G=maxjGwi,jminjGwi,jQmaxQmin,βi,G=minjGwi,j\alpha_{i,G} = \frac{\max_{j \in G} w_{i,j} - \min_{j \in G} w_{i,j}}{Q_{\max} - Q_{\min}}, \quad \beta_{i,G} = \min_{j \in G} w_{i,j}

Each element is coded as a scaled, bias-corrected value plus a LUT lookup; LUTs are derived via weighted K-means, optimized for forward-pass activation statistics from a single curated calibration sample.

CUDA kernels use "mma" tensor-core instructions, with kernel launch path switched based on batch MM:

  • For M8M \leq 8, weights are loaded as left matrices to exploit the 16×1616 \times 16 mma kernel (maximum tile).
  • For M>8M > 8, conventional 8×168 \times 16 mma right-map is used.

All key quantization and dequantization steps (including LUT selection) are performed in-register, using CUDA warp-wide shuffles (__shfl_sync\_\_shfl\_sync), with global loads fully coalesced (16 B) and no shared-memory usage for small MM.

The tinygemm C++/Python API enables transparent switching between quantized GEMM and native fp16/bf16 for embedding or output layers. Host-side routines upload indices, group scales/biases, and per-row LUTs; at inference, a unified GEMM invocation fuses dequantization and matmul.

4. Practical Applications and Performance Characteristics

tinygemm finds direct application where:

  • Batched small-matrix GEMM dominates (finite elements, microkernel ML).
  • LLM transformers are deployed at small batch for latency-critical serving.
  • The target hardware exposes SIMD, tensor-core, or similar acceleration where packing, pointer indirection, or boundary scalar code limits mainstream BLAS kernel utility.

On A100 (PyTorch 2.3, M=1–8):

Format tinygemm Speedup over PyTorch bf16 "mma"
int4 3×\approx 3\times
nf4, any4 2×\approx 2\times

End-to-end LLM inference throughput shows 1.8×1.8\times2.5×2.5\times gains vs. baseline (bf16, batch 141{-}4), primarily from elimination of quantization overhead and avoidance of non-coalesced loads.

5. Integration Patterns, Limitations, and Portability

Integration of tinygemm depends on architecture and target API:

  • On NVIDIA GPUs: Replace batched cuBLAS or PyTorch matmul with tinygemm’s API, providing uniform-stride activations and prepacked weight/LUT buffers.
  • On ARM: Build system compiles IAAT-generated kernel tables and integrates a runtime tiler/dispatcher. No JIT required; all kernel functions are static at link time.

Current known limitations by context:

  • NVIDIA GPU “K20c”: Best performance for m16m \leq 16; scaling to large mm requires new kernel specializations.
  • ARMv8 IAAT: Transpose-N fallback is scalar and less efficient; porting to SVE/x86 requires ISA-specific codegen.
  • LLM any4/any3/any2: Focused on Ampere+ GPUs; for M8M \gg 8, shared-memory tiling is not yet enabled. Very large N can see LUT memory overhead ($16$ fp16 per row).
  • Empirical LLM support up to $70$B-parameter models; MLLMs and next-generation ASICs are subject to ongoing evaluation.

Potential extensions noted include hybrid pack/no-pack adaptive at run time, multi-threaded block dispatch for M,N>80M,N > 80, and integration with orthogonal quantization methods (AWQ, GPTQ).

6. Technical Summary and Comparative Features

The architectural commonality in all “tinygemm” incarnations is the removal of general pointer indirection, scalar cleanups, and packing costs in exchange for either:

This specialization is justified by the market of ML and HPC tasks now saturated with small-matrix GEMM bottlenecks. Table below summarizes the evolution:

Implementation Hardware API strategy Kernel specialization Formats supported Notable speedup
(Jhurani et al., 2013) NVIDIA K20c 3D base+stride arrays m16m \leq 16, template fp32, fp64, cfloat/cdouble $1.3$–6×6\times cuBLAS
(Yao et al., 2022) (IAAT) ARMv8 Static kernel table auto-generated, all (m,n)(m,n) s/d/c/z types $1.8$–20×20\times BLIS
(Elhoushi et al., 7 Jul 2025) Ampere+ GPU Descriptor+LUT API mma tensor-core, LUTs fp16/bf16, int4, anyn $2$–3×3\times PyTorch bf16

A plausible implication is that further generalization to new ISA backends (SVE, Power9) and dynamic kernel dispatch (for “medium” M,NM,N) will dominate future tinygemm research and deployment. At present, tinygemm constitutes an exemplar of the effectiveness of hardware-proximate, workload-specific GEMM specialization in contemporary ML and simulation pipelines.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to tinygemm Library.