Papers
Topics
Authors
Recent
2000 character limit reached

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines (2508.03984v1)

Published 6 Aug 2025 in cs.DC

Abstract: Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines. In this study, we present emulation methods that significantly outperforms conventional approaches. On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43\% improvement in power efficiency compared to native DGEMM for sufficiently large problems. The proposed SGEMM emulation achieves a 3.0x speedup and a 154\% improvement in power efficiency compared to native SGEMM for sufficiently large problems. Furthermore, compared to conventional emulation methods, the proposed emulation achieves more than 2x higher performance and superior power efficiency.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 8 likes about this paper.