MpGEMM: Mixed-Precision GEMM

Updated 2 July 2026

MpGEMM is a mixed-precision matrix multiplication approach that multiplies low-bit quantized weights with high-precision activations without requiring full dequantization.
It leverages innovations like bit-plane decomposition, LUT-based kernels, and operator fusion to reduce computational overhead and improve energy efficiency.
MpGEMM implementations span CPUs, GPUs, FPGAs, and ASICs, enabling efficient inference of large language models and deep neural networks across varied hardware platforms.

MpGEMM refers to "mixed-precision general matrix multiplication," a class of algorithms and hardware schemes in which matrices of differing bit-widths (typically low-bit weights—e.g., 1–4 bits/ternary—and high-precision activations, e.g., fp16/INT8) are multiplied without first dequantizing all operands to a common type. MpGEMM has become central in the efficient inference of quantized LLMs and deep neural networks on modern CPUs, edge devices, FPGAs, and ASICs. Classical hardware lacks native mpGEMM support, which motivated recent algorithmic, software, and hardware advances that either eliminate dequantization overhead via lookup tables, fuse dequantization with accumulation, or exploit bitwise arithmetic and custom pipelines for extreme efficiency.

1. Mathematical Foundations and Formulations

MpGEMM computes

$C = W \cdot A$

where $W$ is typically an $M \times K$ low-bit quantized integer matrix and $A$ is a $K \times N$ full-precision matrix (fp16, int8). Core schemes:

Bit-plane Decomposition: Each $k$ -bit weight entry $W_{i,l}$ is represented as $\sum_{b=0}^{k-1} 2^b w^b_{i,l}$ , $w^b_{i,l} \in \{0,1\}$ ; the matrix product is rewritten as a sum across $k$ binary GEMMs, enabling bit-serial computation or LUT precomputation (Wei et al., 2024).
Ternary/Elementwise LUTs: For ternary or $W$ 0-ary weights, accumulate precomputed dot-products over all $W$ 1 possible length- $W$ 2 patterns between grouped blocks of weights and activations, reducing to table lookup plus accumulations (Wang et al., 17 Feb 2025).
Groupwise MpGEMM: Group columns (length $W$ 3) of $W$ 4 sharing quantization scale/zero-point and perform multiplication in quantized domain, postponing dequantization to after the sum, applied once per group (Zhang et al., 2024).
Non-uniform Quantization LUTs: Store per-row or per-group codebooks $W$ 5, mapping $W$ 6-bit quantized indices to real values, and fuse lookup plus MAC into a single kernel (Zhao et al., 22 Jan 2025).

These formulations preclude the need for full-precision conversion of weights and often eliminate most multiplications in favor of cheaper additions, table lookups, or bit shifts.

2. Key Algorithmic Strategies

Several approaches are central to efficient mpGEMM:

LUT-based kernels: Construct small lookup tables (LUTs) encoding all possible inner-products for short-length chunks (bitwise or elementwise) between weight and activation blocks (Wei et al., 2024, Wang et al., 17 Feb 2025, Mo et al., 2024).
Bitwise/Bit-serial execution: Bit-plane arithmetic decomposes $W$ 7-bit computation into $W$ 8 binary GEMMs or chunked bitwise dot-products, often further optimized by pre-routing or suppressing symmetric calculations (Wei et al., 2024, Shan et al., 26 Nov 2025).
Ternary/ELUT optimization: For ternary weights, elementwise lookup avoids redundancy; mirror consolidation stores only half the potential output values due to sign symmetry (Wang et al., 17 Feb 2025, Shan et al., 26 Nov 2025).
Operator fusion and table symmetrization: Precompute and fuse LUTs with preceding operations (e.g., LayerNorm) in the compiler, further reducing memory movement and compute overhead (Mo et al., 2024).
Non-uniform quantization: Optimize quantization indices and codebooks to minimize output Frobenius norm error by alternating optimization, enabling LUTs for both uniform and highly non-uniform model regimes (Zhao et al., 22 Jan 2025).
Shift-add/Exponent-shift hardware primitives: Replacing multipliers in mpGEMM PEs with efficient logic for shift-add (INT4×INT8) or exponent alignment (INT4×FP16) (Zhang et al., 2024).

The choice of grouping, LUT size, and trade-off between bit-serial and elementwise approaches is governed by hardware vector width, target memory hierarchy, and reuse ratio for LUTs vs. inline computation.

3. Hardware and Software Implementations

Modern mpGEMM implementations span CPUs, GPUs, FPGAs, and ASICs:

CPU (T-MAC): Via SIMD shuffle/lookup, T-MAC organizes weights into bit-planes and builds LUTs for grouped columns; all mpGEMM operations degenerate to native table lookups and additions, with multiplies eliminated (Wei et al., 2024).
GPU (GANQ, LUT Tensor Core): Codebooks reside in fast memory; inference kernels perform quantized index gather and accumulation without explicit dequantization. LUT Tensor Core integrates LUT-based mpGEMM into the tensor-core pipeline with custom instruction support in TVM (Zhao et al., 22 Jan 2025, Mo et al., 2024).
ASIC/FPGA (Platinum, MixPE, Open-Source GEMM Generator):
- Platinum: 4-stage pipeline for path-adapted LUT build, supporting both bit-serial and ternary LUTs. Chunk size and symmetry folding are tuned based on workload, with offline path optimization minimizing hardware additions (Shan et al., 26 Nov 2025).
- MixPE: DOT products are handled by shift-add (INT4×INT8) or exponent-shift (INT4×FP16), dequantization delayed until the end of each group (Zhang et al., 2024).
- Systolic Generator: Fully parameterized input, accumulator, and output precision; Fused Dot-Product microarchitecture supports arbitrary (even custom) number formats, exposed at the BLAS level (Ledoux et al., 2023).

Software APIs commonly expose a high-level, general mp_gemm interface, selecting optimized kernels via ISA or model bit-width.

4. Performance Benchmarks and Efficiency

Representative performance and efficiency gains from mpGEMM strategies:

Platform/Kernel	Throughput (tokens/s) or GOPS	Speedup / Energy Gain	Reference
T-MAC (CPU, 4b, Llama-2-7B)	7.3 tokens/s	1.3–4× over dequant-GEMM, 70% energy reduction	(Wei et al., 2024)
Bitnet.cpp (M2 Ultra, TL2_0, 100B)	7.45 tokens/s	6.25× over FP16, 2.32× over T-MAC/TQ1_0	(Wang et al., 17 Feb 2025)
LUT Tensor Core (INT1×FP16)	61.55 TFLOPS/mm² (tile level)	18.1× area, 15.5× power improvement (over MAC-TC)	(Mo et al., 2024)
Platinum ASIC (b1.58-3B, prefill)	--	73.6× SpikingEyeriss, 2.15× T-MAC (energy: 32.4×)	(Shan et al., 26 Nov 2025)
GANQ (GPU, 3b LLaMA-7B)	--	2.4× FP16 kernel, memory: 13→3 GB	(Zhao et al., 22 Jan 2025)
MixPE-A8 (FPGA, INT4×INT8)	--	4.42× INT8 PE, 2.78× lower energy	(Zhang et al., 2024)

Elementwise LUTs (ELUT) and operator fusion enable sub-2 bits/weight realize state-of-the-art perplexity for leading LLMs while matching or exceeding full-precision inference in speed.

5. Design Trade-Offs and Scalability

Design parameters strongly influence mpGEMM performance:

LUT size vs. reuse: Larger chunk sizes ( $W$ 9) exponentially increase LUT entries $M \times K$ 0 but also increase reuse per lookup; symmetry folding can halve ternary LUT requirements (Shan et al., 26 Nov 2025, Wang et al., 17 Feb 2025).
Bit-serial vs. elementwise LUT: Bit-serial (via bit-plane decomposition) offers linear scaling in bit-width with minimal overhead but can be suboptimal for ternary or higher-cardinality weights. Elementwise LUT is more spatially/temporally efficient in these cases (Wang et al., 17 Feb 2025).
Hardware-software co-design: Systems such as LUT Tensor Core, Platinum, and MixPE demonstrate that optimizing memory layout, kernel vectorization, and compiler scheduling can bring practical realizations close to architectural peaks with minimal power and area (Mo et al., 2024, Shan et al., 26 Nov 2025, Zhang et al., 2024).
Quantization granularity: Finer granularity (e.g., per-group, per-channel) tightens representation error but increases table management and post-processing cost (Zhang et al., 2024).

A further consideration is the extension of mpGEMM idioms to convolutions, attention modules, or backpropagation, where lookup-table or shift-add primitives may need specialized adaptation.

6. Applications and Impact in Modern LLM Inference

LUT- and shift-add–optimized mpGEMM schemes enable on-device inference of models such as Llama, BitNet, and ultra-large LLMs on CPUs, edge SoCs, FPGAs, and ASICs, shrinking inference cost, latency, and energy by orders of magnitude compared with legacy FP16/FP32 workflows. Open-source frameworks such as T-MAC, Bitnet.cpp, and the Fused Dot-Product generator provide turnkey mpGEMM integration with BLAS-level transparency, delivering speed and efficiency improvements without accuracy regression (Wei et al., 2024, Wang et al., 17 Feb 2025, Ledoux et al., 2023).

In summary, MpGEMM is a foundational technology underpinning modern scalable AI inference. Through algorithmic innovation (LUTs, shift-add circuits), systematic hardware-software co-design, and rigorous quantization-aware dataflows, mpGEMM unlocks extreme efficiency for low-bit neural network workloads across diverse compute platforms.