Working-memory overhead in Ozaki-II INT8 complex GEMM emulation
Develop methods that reduce the substantial working memory requirements of Ozaki-II–based emulation of single- and double-precision complex matrix multiplication (CGEMM and ZGEMM) on INT8 matrix engines while still achieving high performance without resorting to FP32 or FP64 matrix multiplication.
Sponsor
References
A major limitation of the proposed method, as well as emulation-based approaches in general, is the substantial working memory required. This overhead is currently unavoidable when aiming to achieve high performance without relying on FP32 or FP64 matrix multiplication. Addressing this issue remains an open challenge not only for emulation techniques but also for HPC applications that employ them.
— Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem
(2512.08321 - Uchino et al., 9 Dec 2025) in Section 5 (Conclusion)