DGEMM without FP64 Arithmetic -- using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme (2508.00441v1)

Published 1 Aug 2025 in cs.PF, cs.AR, and cs.MS

Abstract: Since AI computations require low-precision matrix multiplications, processors with enhanced performance for these operations are increasing along with the growing demand for AI computations. However, it is difficult to use these operations directly for scientific computations. The Ozaki scheme, an accurate matrix multiplication method proposed by Ozaki et al. in 2012, enables FP64 matrix multiplication (DGEMM) using low-precision floating-point operations such as FP16. The method was subsequently extended to utilize integer arithmetic. The use of integer operations reduces computational cost compared to the floating-point based approach. It has also demonstrated higher performance than hardware FP64 operations on GPUs with fast INT8 Tensor Cores for AI workloads. However, the latest hardware tends to enhance low-precision floating-point operation performance such as FP8 instead of INT8. This study revisits the utilization of low-precision floating-point operations in the Ozaki scheme, considering the latest AI hardware. Specifically, we consider the use of FP6 and FP8 Tensor Cores. Moreover, for processors that support very slow FP64 operations or do not support them at all, we consider the use of the FP64 emulation based on integer arithmetic. We also examine a new blocking strategy. We demonstrate the effectiveness of these methods by evaluating the performance of DGEMM using FP8 Tensor Cores and FP64 emulation on a Blackwell architecture GPU.