EmuGEMM: High-Precision GEMM Emulation
- EmuGEMM is a precision-emulation framework for GEMM that employs Ozaki Schemes I and II to combine low-precision integer operations with scalable mathematical reconstructions for high-precision results.
- It integrates hardware-level optimizations such as on-chip accumulation, interleaved data layouts, and pipeline fusion to minimize memory round-trips and maximize throughput on NVIDIA Hopper and Blackwell GPUs.
- EmuGEMM achieves significant performance and energy efficiency gains, offering 1.4–5.5× speedups and 43–154% power efficiency improvements over traditional and native high-precision GEMM methods.
EmuGEMM is a precision-emulation framework for general matrix–matrix multiplication (GEMM) that exploits the throughput and power efficiency of low-precision (INT8) matrix engines—such as NVIDIA/AMD Tensor Cores—to achieve floating-point (FP32, FP64) or arbitrary-precision results. EmuGEMM systematically implements Ozaki Schemes I and II, fusing mathematical reconstruction strategies with hardware-level pipeline optimizations, and demonstrates substantial performance and energy efficiency improvements over existing software and hardware approaches on modern GPUs (Uchino et al., 6 Aug 2025, Ozaki et al., 10 Apr 2025, Lu et al., 24 Jun 2026).
1. Mathematical Foundations: Ozaki Schemes I and II
EmuGEMM builds on two modular emulation paradigms: mantissa splitting (Ozaki Scheme I) and integer modular arithmetic with the Chinese Remainder Theorem (Ozaki Scheme II).
Scheme I (Mantissa Splitting):
Matrices , are decomposed into INT8 slices per operand, quantized with power-of-two row/column scalings , . The approximation
yields a blockwise multiplication
requiring exact INT8→INT32 GEMMs.
Scheme II (CRT-Based Modular Arithmetic):
The input matrices are scaled/truncated to , then, for each of pairwise-coprime moduli 0, residue matrices 1 are formed and multiplied: 2 The outputs 3 are then reconstructed with the Chinese Remainder Theorem: 4 where 5 is the inverse of 6 modulo 7, 8, and scaling is inverted to recover high-precision 9.
The theoretical precision of Scheme II is set by 0, controllable via 1 and the sizes of the 2 (Ozaki et al., 10 Apr 2025, Lu et al., 24 Jun 2026).
2. End-to-End Algorithmic Pipeline
The EmuGEMM workflow for high-precision emulation proceeds as follows (Uchino et al., 6 Aug 2025, Ozaki et al., 10 Apr 2025):
- Moduli and Preprocessing:
- Select 3 coprime moduli 4 (for Scheme II), or 5 slices and shift width 6 (Scheme I).
- Compute necessary pre-factors and scaling vectors for diagonal alignment (row 7, column 8).
- Integerization and Scaling:
- Scale and truncate 9, 0 to integer/INT8 range using powers of two for exponents, so as to guarantee uniqueness: 1.
- For each modulus (Scheme II), compute 2, 3 as INT8 matrices.
- Low-Precision GEMMs:
- Execute 4 independent INT85INT86INT32 GEMMs per modulus (Scheme II) or 7 for mantissa slices (Scheme I).
- Modular Reduction and Accumulation:
- Post-process INT32 accumulators to residues (e.g., modulo 8 for CRT).
- For Scheme II, accumulate partial results in higher-precision (FP64, double-double if necessary), handling rounding/carry as required for FP64 emulation.
- CRT Reconstruction and Scaling:
- Apply CRT to combine the mod-9 results.
- Re-invert the scaling: 0 (both schemes).
- Result:
- Output matches floating-point accuracy (FP32/FP64) with errors determined only by the number of moduli (Scheme II) or slices (Scheme I), and the accuracy of truncation/scaling.
3. Hardware and Dataflow Optimizations
EmuGEMM achieves high efficiency on NVIDIA Hopper and Blackwell by fusing integer GEMM operations with on-chip accumulation, optimized tiling, and shared-memory pipelining (Lu et al., 24 Jun 2026).
- Elimination of Global Memory Round-Trips:
- INT32 accumulators for all slice/modulus products are held on-chip (in RF or TMEM) throughout the 1-loop. No intermediate partial sums are written/read from global memory, relieving bandwidth bottlenecks.
- Interleaved Data Layout (Scheme I):
- 2 INT8 slices are interleaved at the Tensor Core tile granularity along 3; all slices are loaded together using a single TMA descriptor, maximizing memory efficiency.
- Persistent Kernel Fusion:
- Fusion of all 4 MMA evaluations ensures occupancy and maximizes throughput.
- On-Chip Modular Reduction (Scheme II):
- Modulo reductions are performed on the accumulator registers or TMEM epilogue, and only final INT8 tiles are written back.
- Pipeline Overlap:
- Data movement (TMA load), MMA compute, and epilogue (shift-reduce or modulo) are overlapped to hide memory and pipeline latencies, saturating INT8 hardware.
- Blackwell Architecture Features:
- Dedicated TMEM expands the effective pipeline depth, enabling higher 5 and larger tile sizes.
4. Performance, Scalability, and Power Efficiency
The EmuGEMM implementations realize substantial speedups and energy improvements over native and legacy emulation methods (Uchino et al., 6 Aug 2025, Lu et al., 24 Jun 2026):
| Architecture | Method | Precision | Matrix Size 6 | Throughput (TFLOP/s) | Speedup vs Native | Power Efficiency Gain |
|---|---|---|---|---|---|---|
| GH200 Hopper | EmuGEMM-II | FP64 | 16,384 | 81.6 | 1.4× | 43% |
| GH200 Hopper | EmuGEMM-II | FP32 | 16,384 | 160 | 3.0× | 154% |
| Hopper | EmuGEMM-I, 7 | FP32 | 16,384 | 400 | 1.7× (vs FP32) | – |
| Hopper | EmuGEMM-II, 8 | FP64 | 16,384 | 94 | 1.6× (vs FP64) | – |
| Blackwell B200 | EmuGEMM-II, 9 | FP64 | 16,384 | 165 | 4.6× (vs FP64) | – |
Key highlights:
- EmuGEMM maintains 1.4–3.0× throughput advantage over native DGEMM/SGEMM at large 0.
- Power efficiency improvements of 43–154% over FP64/FP32 are reported for large problems.
- On Blackwell, EmuGEMM-II outperforms cuBLAS ZGEMM by up to 5.5× at FP64-level accuracy.
Overhead from scaling, residue conversion, and CRT reconstruction falls below 1 for 2, making EmuGEMM optimal for large-scale GEMM.
5. Precision–Throughput Trade-Offs and Parameterization
The emulation accuracy and cost are governed by the number of moduli (Scheme II) or split slices (Scheme I):
- Scheme I precision is nearly linear in 3, with each slice width 4 yielding about 5 bits. For example, 6 slices and 7 offer 832 bits.
- Scheme II achieves 9 bits, with 0 moduli sufficing for FP64, 1 for FP32 accuracy. Number and size of moduli adapt to application needs.
A smaller 2 reduces overhead and matches, or exceeds, intermediate precisions (e.g., between FP32 and TF32).
6. Limitations and Application Scope
EmuGEMM is most beneficial for large GEMM problems (3), where the compute dominates the overheads of scaling, conversion, and reconstruction (Uchino et al., 6 Aug 2025, Lu et al., 24 Jun 2026).
- For small matrices (4), overheads may reach 5–6% of runtime.
- Accuracy for highly ill-conditioned matrices may require “accurate” mode scaling (an extra INT8 GEMM) to properly bound dynamic range.
- EmuGEMM can be tuned for various hardware (any with INT87INT8 Tensor Core analogues), using standard CUDA WMMA/AMX APIs and on-chip optimized kernels; legacy hardware or those lacking sufficient on-chip storage may see reduced benefit.
7. Comparative Frameworks and Benchmarks
EmuGEMM substantially exceeds the performance of prior emulation schemes and vendor-mixed-precision libraries:
- Compared to legacy Ozaki (Scheme I) emulation on CPU, CRT-based EmuGEMM (Scheme II) reduces GEMM calls by up to 8 and more than doubles throughput for quadruple-precision emulation (Ozaki et al., 10 Apr 2025).
- In single-precision, EmuGEMM achieves >2× higher performance and lower power than cuMpSGEMM and BF16×9 on large matrices (Uchino et al., 6 Aug 2025, Lu et al., 24 Jun 2026).
- Elimination of global-memory round-trips for INT32 partial sums is a primary source of speedup, raising arithmetic intensity by up to 9 (Lu et al., 24 Jun 2026).
Summary
EmuGEMM synthesizes Ozaki Scheme I (mantissa splitting) and Scheme II (CRT-based modular arithmetic) into an optimized set of fused-kernel implementations targeting the highest-throughput, lowest-power matrix engines. Mathematical rigor is preserved up to and beyond FP64, with parameterizable accuracy–throughput trade-offs. Its principal innovations—strict on-chip execution, specialized data layouts, and pipeline fusion—allow modern architectures like Hopper and Blackwell to approach the intrinsic INT8 roofline even for full-precision GEMM. This positions EmuGEMM as a practical and high-performance tool for scalable mixed-precision and scientific compute workloads (Uchino et al., 6 Aug 2025, Ozaki et al., 10 Apr 2025, Lu et al., 24 Jun 2026).