Papers
Topics
Authors
Recent
Search
2000 character limit reached

EmuGEMM: High-Precision GEMM Emulation

Updated 26 June 2026
  • EmuGEMM is a precision-emulation framework for GEMM that employs Ozaki Schemes I and II to combine low-precision integer operations with scalable mathematical reconstructions for high-precision results.
  • It integrates hardware-level optimizations such as on-chip accumulation, interleaved data layouts, and pipeline fusion to minimize memory round-trips and maximize throughput on NVIDIA Hopper and Blackwell GPUs.
  • EmuGEMM achieves significant performance and energy efficiency gains, offering 1.4–5.5× speedups and 43–154% power efficiency improvements over traditional and native high-precision GEMM methods.

EmuGEMM is a precision-emulation framework for general matrix–matrix multiplication (GEMM) that exploits the throughput and power efficiency of low-precision (INT8) matrix engines—such as NVIDIA/AMD Tensor Cores—to achieve floating-point (FP32, FP64) or arbitrary-precision results. EmuGEMM systematically implements Ozaki Schemes I and II, fusing mathematical reconstruction strategies with hardware-level pipeline optimizations, and demonstrates substantial performance and energy efficiency improvements over existing software and hardware approaches on modern GPUs (Uchino et al., 6 Aug 2025, Ozaki et al., 10 Apr 2025, Lu et al., 24 Jun 2026).

1. Mathematical Foundations: Ozaki Schemes I and II

EmuGEMM builds on two modular emulation paradigms: mantissa splitting (Ozaki Scheme I) and integer modular arithmetic with the Chinese Remainder Theorem (Ozaki Scheme II).

Scheme I (Mantissa Splitting):

Matrices ARM×KA \in \mathbb{R}^{M \times K}, BRK×NB \in \mathbb{R}^{K \times N} are decomposed into pp INT8 slices per operand, quantized with power-of-two row/column scalings μ\mu, ν\nu. The approximation

Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)

yields a blockwise multiplication

Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}

requiring p(p+1)/2p(p+1)/2 exact INT8→INT32 GEMMs.

Scheme II (CRT-Based Modular Arithmetic):

The input matrices are scaled/truncated to A,BZA', B' \in \mathbb{Z}, then, for each of NN pairwise-coprime moduli BRK×NB \in \mathbb{R}^{K \times N}0, residue matrices BRK×NB \in \mathbb{R}^{K \times N}1 are formed and multiplied: BRK×NB \in \mathbb{R}^{K \times N}2 The outputs BRK×NB \in \mathbb{R}^{K \times N}3 are then reconstructed with the Chinese Remainder Theorem: BRK×NB \in \mathbb{R}^{K \times N}4 where BRK×NB \in \mathbb{R}^{K \times N}5 is the inverse of BRK×NB \in \mathbb{R}^{K \times N}6 modulo BRK×NB \in \mathbb{R}^{K \times N}7, BRK×NB \in \mathbb{R}^{K \times N}8, and scaling is inverted to recover high-precision BRK×NB \in \mathbb{R}^{K \times N}9.

The theoretical precision of Scheme II is set by pp0, controllable via pp1 and the sizes of the pp2 (Ozaki et al., 10 Apr 2025, Lu et al., 24 Jun 2026).

2. End-to-End Algorithmic Pipeline

The EmuGEMM workflow for high-precision emulation proceeds as follows (Uchino et al., 6 Aug 2025, Ozaki et al., 10 Apr 2025):

  1. Moduli and Preprocessing:
    • Select pp3 coprime moduli pp4 (for Scheme II), or pp5 slices and shift width pp6 (Scheme I).
    • Compute necessary pre-factors and scaling vectors for diagonal alignment (row pp7, column pp8).
  2. Integerization and Scaling:
    • Scale and truncate pp9, μ\mu0 to integer/INT8 range using powers of two for exponents, so as to guarantee uniqueness: μ\mu1.
    • For each modulus (Scheme II), compute μ\mu2, μ\mu3 as INT8 matrices.
  3. Low-Precision GEMMs:
    • Execute μ\mu4 independent INT8μ\mu5INT8μ\mu6INT32 GEMMs per modulus (Scheme II) or μ\mu7 for mantissa slices (Scheme I).
  4. Modular Reduction and Accumulation:
    • Post-process INT32 accumulators to residues (e.g., modulo μ\mu8 for CRT).
    • For Scheme II, accumulate partial results in higher-precision (FP64, double-double if necessary), handling rounding/carry as required for FP64 emulation.
  5. CRT Reconstruction and Scaling:
    • Apply CRT to combine the mod-μ\mu9 results.
    • Re-invert the scaling: ν\nu0 (both schemes).
  6. Result:
    • Output matches floating-point accuracy (FP32/FP64) with errors determined only by the number of moduli (Scheme II) or slices (Scheme I), and the accuracy of truncation/scaling.

3. Hardware and Dataflow Optimizations

EmuGEMM achieves high efficiency on NVIDIA Hopper and Blackwell by fusing integer GEMM operations with on-chip accumulation, optimized tiling, and shared-memory pipelining (Lu et al., 24 Jun 2026).

  • Elimination of Global Memory Round-Trips:
    • INT32 accumulators for all slice/modulus products are held on-chip (in RF or TMEM) throughout the ν\nu1-loop. No intermediate partial sums are written/read from global memory, relieving bandwidth bottlenecks.
  • Interleaved Data Layout (Scheme I):
    • ν\nu2 INT8 slices are interleaved at the Tensor Core tile granularity along ν\nu3; all slices are loaded together using a single TMA descriptor, maximizing memory efficiency.
  • Persistent Kernel Fusion:
    • Fusion of all ν\nu4 MMA evaluations ensures occupancy and maximizes throughput.
  • On-Chip Modular Reduction (Scheme II):
    • Modulo reductions are performed on the accumulator registers or TMEM epilogue, and only final INT8 tiles are written back.
  • Pipeline Overlap:
    • Data movement (TMA load), MMA compute, and epilogue (shift-reduce or modulo) are overlapped to hide memory and pipeline latencies, saturating INT8 hardware.
  • Blackwell Architecture Features:
    • Dedicated TMEM expands the effective pipeline depth, enabling higher ν\nu5 and larger tile sizes.

4. Performance, Scalability, and Power Efficiency

The EmuGEMM implementations realize substantial speedups and energy improvements over native and legacy emulation methods (Uchino et al., 6 Aug 2025, Lu et al., 24 Jun 2026):

Architecture Method Precision Matrix Size ν\nu6 Throughput (TFLOP/s) Speedup vs Native Power Efficiency Gain
GH200 Hopper EmuGEMM-II FP64 16,384 81.6 1.4× 43%
GH200 Hopper EmuGEMM-II FP32 16,384 160 3.0× 154%
Hopper EmuGEMM-I, ν\nu7 FP32 16,384 400 1.7× (vs FP32)
Hopper EmuGEMM-II, ν\nu8 FP64 16,384 94 1.6× (vs FP64)
Blackwell B200 EmuGEMM-II, ν\nu9 FP64 16,384 165 4.6× (vs FP64)

Key highlights:

  • EmuGEMM maintains 1.4–3.0× throughput advantage over native DGEMM/SGEMM at large Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)0.
  • Power efficiency improvements of 43–154% over FP64/FP32 are reported for large problems.
  • On Blackwell, EmuGEMM-II outperforms cuBLAS ZGEMM by up to 5.5× at FP64-level accuracy.

Overhead from scaling, residue conversion, and CRT reconstruction falls below Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)1 for Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)2, making EmuGEMM optimal for large-scale GEMM.

5. Precision–Throughput Trade-Offs and Parameterization

The emulation accuracy and cost are governed by the number of moduli (Scheme II) or split slices (Scheme I):

  • Scheme I precision is nearly linear in Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)3, with each slice width Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)4 yielding about Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)5 bits. For example, Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)6 slices and Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)7 offer Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)832 bits.
  • Scheme II achieves Adiag(μ)i=0p12βiAi,Bj=0p12βjBjdiag(ν)A \approx \operatorname{diag}(\mu)\sum_{i=0}^{p-1}2^{-\beta i}A'_i,\qquad B \approx \sum_{j=0}^{p-1}2^{-\beta j}B'_j\operatorname{diag}(\nu)9 bits, with Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}0 moduli sufficing for FP64, Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}1 for FP32 accuracy. Number and size of moduli adapt to application needs.

A smaller Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}2 reduces overhead and matches, or exceeds, intermediate precisions (e.g., between FP32 and TF32).

6. Limitations and Application Scope

EmuGEMM is most beneficial for large GEMM problems (Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}3), where the compute dominates the overheads of scaling, conversion, and reconstruction (Uchino et al., 6 Aug 2025, Lu et al., 24 Jun 2026).

  • For small matrices (Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}4), overheads may reach Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}5–Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}6% of runtime.
  • Accuracy for highly ill-conditioned matrices may require “accurate” mode scaling (an extra INT8 GEMM) to properly bound dynamic range.
  • EmuGEMM can be tuned for various hardware (any with INT8Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}7INT8 Tensor Core analogues), using standard CUDA WMMA/AMX APIs and on-chip optimized kernels; legacy hardware or those lacking sufficient on-chip storage may see reduced benefit.

7. Comparative Frameworks and Benchmarks

EmuGEMM substantially exceeds the performance of prior emulation schemes and vendor-mixed-precision libraries:

  • Compared to legacy Ozaki (Scheme I) emulation on CPU, CRT-based EmuGEMM (Scheme II) reduces GEMM calls by up to Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}8 and more than doubles throughput for quadruple-precision emulation (Ozaki et al., 10 Apr 2025).
  • In single-precision, EmuGEMM achieves >2× higher performance and lower power than cuMpSGEMM and BF16×9 on large matrices (Uchino et al., 6 Aug 2025, Lu et al., 24 Jun 2026).
  • Elimination of global-memory round-trips for INT32 partial sums is a primary source of speedup, raising arithmetic intensity by up to Cdiag(μ)(s=0p12βsCs)diag(ν),Cs=i=0sAiBsiC \approx \operatorname{diag}(\mu)\left(\sum_{s=0}^{p-1}2^{-\beta s}C_s\right)\operatorname{diag}(\nu),\qquad C_s = \sum_{i=0}^{s}A'_i B'_{s-i}9 (Lu et al., 24 Jun 2026).

Summary

EmuGEMM synthesizes Ozaki Scheme I (mantissa splitting) and Scheme II (CRT-based modular arithmetic) into an optimized set of fused-kernel implementations targeting the highest-throughput, lowest-power matrix engines. Mathematical rigor is preserved up to and beyond FP64, with parameterizable accuracy–throughput trade-offs. Its principal innovations—strict on-chip execution, specialized data layouts, and pipeline fusion—allow modern architectures like Hopper and Blackwell to approach the intrinsic INT8 roofline even for full-precision GEMM. This positions EmuGEMM as a practical and high-performance tool for scalable mixed-precision and scientific compute workloads (Uchino et al., 6 Aug 2025, Ozaki et al., 10 Apr 2025, Lu et al., 24 Jun 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EmuGEMM.