A Pilot Study on Tunable Precision Emulation via Automatic BLAS Offloading (2503.22875v2)

Published 28 Mar 2025 in cs.DC and cs.PF

Abstract: This study explores the use of automatic BLAS offloading and INT8-based emulation for accelerating traditional HPC workloads on modern GPU architectures. Through the use of low-bitwidth integer units and cache-coherent Unified Memory Architecture, we emulate double-precision matrix multiplications in the MuST application without code changes. We find that accuracy depends on both arithmetic precision and the properties of the operator, which can be dealt with through tunable precision emulation. Unlike traditional mixed-precision approaches, this method preserves original algorithms while optimizing hardware utilization. We showcases the potential of improving accuracy and performance at the same time. This work highlights the potential of AI-driven hardware to transform HPC, advocating for adaptive precision strategies in future scientific computing.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (3)

Tweets

https://twitter.com/tensorcore/status/1906931477209923714

https://twitter.com/HPCPapers/status/1918282750123278353

A Pilot Study on Tunable Precision Emulation via Automatic BLAS Offloading (2503.22875v2)

Summary

Follow-up Questions

Related Papers

Authors (3)

Tweets