Working-memory overhead in Ozaki-II INT8 complex GEMM emulation

Develop methods that reduce the substantial working memory requirements of Ozaki-II–based emulation of single- and double-precision complex matrix multiplication (CGEMM and ZGEMM) on INT8 matrix engines while still achieving high performance without resorting to FP32 or FP64 matrix multiplication.

Background

The paper extends the Ozaki-II scheme to complex matrix multiplication and implements high-performance emulation using INT8 matrix engines with CRT-based reconstruction and a Karatsuba formulation. While the approach delivers large speedups over native FP32/FP64 complex GEMM and Ozaki-I-based emulation, it requires storing multiple modular residues and intermediate INT32 results, leading to substantial working memory usage.

The authors state that this memory overhead is currently unavoidable when targeting high performance without relying on FP32 or FP64 matrix multiplication. They explicitly identify addressing this working-memory burden as an open challenge affecting both emulation techniques and the HPC applications that adopt them.

References

A major limitation of the proposed method, as well as emulation-based approaches in general, is the substantial working memory required. This overhead is currently unavoidable when aiming to achieve high performance without relying on FP32 or FP64 matrix multiplication. Addressing this issue remains an open challenge not only for emulation techniques but also for HPC applications that employ them.

Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem (2512.08321 - Uchino et al., 9 Dec 2025) in Section 5 (Conclusion)