EmuGEMM: High-Precision GEMM Emulation

Updated 26 June 2026

EmuGEMM is a precision-emulation framework for GEMM that employs Ozaki Schemes I and II to combine low-precision integer operations with scalable mathematical reconstructions for high-precision results.
It integrates hardware-level optimizations such as on-chip accumulation, interleaved data layouts, and pipeline fusion to minimize memory round-trips and maximize throughput on NVIDIA Hopper and Blackwell GPUs.
EmuGEMM achieves significant performance and energy efficiency gains, offering 1.4–5.5× speedups and 43–154% power efficiency improvements over traditional and native high-precision GEMM methods.

EmuGEMM is a precision-emulation framework for general matrix–matrix multiplication (GEMM) that exploits the throughput and power efficiency of low-precision (INT8) matrix engines—such as NVIDIA/AMD Tensor Cores—to achieve floating-point (FP32, FP64) or arbitrary-precision results. EmuGEMM systematically implements Ozaki Schemes I and II, fusing mathematical reconstruction strategies with hardware-level pipeline optimizations, and demonstrates substantial performance and energy efficiency improvements over existing software and hardware approaches on modern GPUs (Uchino et al., 6 Aug 2025, Ozaki et al., 10 Apr 2025, Lu et al., 24 Jun 2026).

1. Mathematical Foundations: Ozaki Schemes I and II

EmuGEMM builds on two modular emulation paradigms: mantissa splitting (Ozaki Scheme I) and integer modular arithmetic with the Chinese Remainder Theorem (Ozaki Scheme II).

Scheme I (Mantissa Splitting):

Matrices $A \in \mathbb{R}^{M \times K}$ , $B \in \mathbb{R}^{K \times N}$ are decomposed into $p$ INT8 slices per operand, quantized with power-of-two row/column scalings $\mu$ , $\nu$ . The approximation

$A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$

yields a blockwise multiplication

$C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$

requiring $p(p+1)/2$ exact INT8→INT32 GEMMs.

Scheme II (CRT-Based Modular Arithmetic):

The input matrices are scaled/truncated to $A', B' \in \mathbb{Z}$ , then, for each of $N$ pairwise-coprime moduli $B \in \mathbb{R}^{K \times N}$ 0, residue matrices $B \in \mathbb{R}^{K \times N}$ 1 are formed and multiplied: $B \in \mathbb{R}^{K \times N}$ 2 The outputs $B \in \mathbb{R}^{K \times N}$ 3 are then reconstructed with the Chinese Remainder Theorem: $B \in \mathbb{R}^{K \times N}$ 4 where $B \in \mathbb{R}^{K \times N}$ 5 is the inverse of $B \in \mathbb{R}^{K \times N}$ 6 modulo $B \in \mathbb{R}^{K \times N}$ 7, $B \in \mathbb{R}^{K \times N}$ 8, and scaling is inverted to recover high-precision $B \in \mathbb{R}^{K \times N}$ 9.

The theoretical precision of Scheme II is set by $p$ 0, controllable via $p$ 1 and the sizes of the $p$ 2 (Ozaki et al., 10 Apr 2025, Lu et al., 24 Jun 2026).

2. End-to-End Algorithmic Pipeline

The EmuGEMM workflow for high-precision emulation proceeds as follows (Uchino et al., 6 Aug 2025, Ozaki et al., 10 Apr 2025):

Moduli and Preprocessing:
- Select $p$ 3 coprime moduli $p$ 4 (for Scheme II), or $p$ 5 slices and shift width $p$ 6 (Scheme I).
- Compute necessary pre-factors and scaling vectors for diagonal alignment (row $p$ 7, column $p$ 8).
Integerization and Scaling:
- Scale and truncate $p$ 9, $\mu$ 0 to integer/INT8 range using powers of two for exponents, so as to guarantee uniqueness: $\mu$ 1.
- For each modulus (Scheme II), compute $\mu$ 2, $\mu$ 3 as INT8 matrices.
Low-Precision GEMMs:
- Execute $\mu$ 4 independent INT8 $\mu$ 5INT8 $\mu$ 6INT32 GEMMs per modulus (Scheme II) or $\mu$ 7 for mantissa slices (Scheme I).
Modular Reduction and Accumulation:
- Post-process INT32 accumulators to residues (e.g., modulo $\mu$ 8 for CRT).
- For Scheme II, accumulate partial results in higher-precision (FP64, double-double if necessary), handling rounding/carry as required for FP64 emulation.
CRT Reconstruction and Scaling:
- Apply CRT to combine the mod- $\mu$ 9 results.
- Re-invert the scaling: $\nu$ 0 (both schemes).
Result:
- Output matches floating-point accuracy (FP32/FP64) with errors determined only by the number of moduli (Scheme II) or slices (Scheme I), and the accuracy of truncation/scaling.

3. Hardware and Dataflow Optimizations

EmuGEMM achieves high efficiency on NVIDIA Hopper and Blackwell by fusing integer GEMM operations with on-chip accumulation, optimized tiling, and shared-memory pipelining (Lu et al., 24 Jun 2026).

Elimination of Global Memory Round-Trips:
- INT32 accumulators for all slice/modulus products are held on-chip (in RF or TMEM) throughout the $\nu$ 1-loop. No intermediate partial sums are written/read from global memory, relieving bandwidth bottlenecks.
Interleaved Data Layout (Scheme I):
- $\nu$ 2 INT8 slices are interleaved at the Tensor Core tile granularity along $\nu$ 3; all slices are loaded together using a single TMA descriptor, maximizing memory efficiency.
Persistent Kernel Fusion:
- Fusion of all $\nu$ 4 MMA evaluations ensures occupancy and maximizes throughput.
On-Chip Modular Reduction (Scheme II):
- Modulo reductions are performed on the accumulator registers or TMEM epilogue, and only final INT8 tiles are written back.
Pipeline Overlap:
- Data movement (TMA load), MMA compute, and epilogue (shift-reduce or modulo) are overlapped to hide memory and pipeline latencies, saturating INT8 hardware.
Blackwell Architecture Features:
- Dedicated TMEM expands the effective pipeline depth, enabling higher $\nu$ 5 and larger tile sizes.

4. Performance, Scalability, and Power Efficiency

The EmuGEMM implementations realize substantial speedups and energy improvements over native and legacy emulation methods (Uchino et al., 6 Aug 2025, Lu et al., 24 Jun 2026):

Architecture	Method	Precision	Matrix Size $\nu$ 6	Throughput (TFLOP/s)	Speedup vs Native	Power Efficiency Gain
GH200 Hopper	EmuGEMM-II	FP64	16,384	81.6	1.4×	43%
GH200 Hopper	EmuGEMM-II	FP32	16,384	160	3.0×	154%
Hopper	EmuGEMM-I, $\nu$ 7	FP32	16,384	400	1.7× (vs FP32)	–
Hopper	EmuGEMM-II, $\nu$ 8	FP64	16,384	94	1.6× (vs FP64)	–
Blackwell B200	EmuGEMM-II, $\nu$ 9	FP64	16,384	165	4.6× (vs FP64)	–

Key highlights:

EmuGEMM maintains 1.4–3.0× throughput advantage over native DGEMM/SGEMM at large $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 0.
Power efficiency improvements of 43–154% over FP64/FP32 are reported for large problems.
On Blackwell, EmuGEMM-II outperforms cuBLAS ZGEMM by up to 5.5× at FP64-level accuracy.

Overhead from scaling, residue conversion, and CRT reconstruction falls below $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 1 for $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 2, making EmuGEMM optimal for large-scale GEMM.

5. Precision–Throughput Trade-Offs and Parameterization

The emulation accuracy and cost are governed by the number of moduli (Scheme II) or split slices (Scheme I):

Scheme I precision is nearly linear in $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 3, with each slice width $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 4 yielding about $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 5 bits. For example, $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 6 slices and $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 7 offer $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 832 bits.
Scheme II achieves $A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)$ 9 bits, with $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 0 moduli sufficing for FP64, $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 1 for FP32 accuracy. Number and size of moduli adapt to application needs.

A smaller $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 2 reduces overhead and matches, or exceeds, intermediate precisions (e.g., between FP32 and TF32).

6. Limitations and Application Scope

EmuGEMM is most beneficial for large GEMM problems ( $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 3), where the compute dominates the overheads of scaling, conversion, and reconstruction (Uchino et al., 6 Aug 2025, Lu et al., 24 Jun 2026).

For small matrices ( $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 4), overheads may reach $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 5– $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 6% of runtime.
Accuracy for highly ill-conditioned matrices may require “accurate” mode scaling (an extra INT8 GEMM) to properly bound dynamic range.
EmuGEMM can be tuned for various hardware (any with INT8 $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 7INT8 Tensor Core analogues), using standard CUDA WMMA/AMX APIs and on-chip optimized kernels; legacy hardware or those lacking sufficient on-chip storage may see reduced benefit.

7. Comparative Frameworks and Benchmarks

EmuGEMM substantially exceeds the performance of prior emulation schemes and vendor-mixed-precision libraries:

Compared to legacy Ozaki (Scheme I) emulation on CPU, CRT-based EmuGEMM (Scheme II) reduces GEMM calls by up to $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 8 and more than doubles throughput for quadruple-precision emulation (Ozaki et al., 10 Apr 2025).
In single-precision, EmuGEMM achieves >2× higher performance and lower power than cuMpSGEMM and BF16×9 on large matrices (Uchino et al., 6 Aug 2025, Lu et al., 24 Jun 2026).
Elimination of global-memory round-trips for INT32 partial sums is a primary source of speedup, raising arithmetic intensity by up to $C \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}$ 9 (Lu et al., 24 Jun 2026).

Summary

EmuGEMM synthesizes Ozaki Scheme I (mantissa splitting) and Scheme II (CRT-based modular arithmetic) into an optimized set of fused-kernel implementations targeting the highest-throughput, lowest-power matrix engines. Mathematical rigor is preserved up to and beyond FP64, with parameterizable accuracy–throughput trade-offs. Its principal innovations—strict on-chip execution, specialized data layouts, and pipeline fusion—allow modern architectures like Hopper and Blackwell to approach the intrinsic INT8 roofline even for full-precision GEMM. This positions EmuGEMM as a practical and high-performance tool for scalable mixed-precision and scientific compute workloads (Uchino et al., 6 Aug 2025, Ozaki et al., 10 Apr 2025, Lu et al., 24 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (3)

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines (2025)

Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique (2025)

EmuGEMM: Fused Tensor Core Kernels for Precision Emulation in Matrix Multiplication (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EmuGEMM.