Papers
Topics
Authors
Recent
2000 character limit reached

A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation (2512.00053v1)

Published 19 Nov 2025 in cs.AR

Abstract: Efficient mixed-precision MMA operations are critical for accelerating Deep Learning workloads on GPGPUs. However, existing open-source RTL implementations of inner dot products rely on discrete arithmetic units, leading to suboptimal throughput and poor resource utilization. To address these challenges, we propose a scalable mixed-precision dot product unit that integrates floating-point and integer arithmetic pipelines within a singular fused architecture, implemented as part of the open-source RISC-V based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in (FP16/BF16/FP8/BF8/INT8/UINT4) formats and higher-precision accumulation in (FP32/INT32), with an extensible framework for adding and evaluating other custom representations in the future. Experimental results demonstrate 4-cycle operation latency at 306.6 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA, delivering an ideal filled pipeline throughput of 9.812 GFLOPS in a 4-thread per warp configuration.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.