Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
103 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
50 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

MXDOTP: A RISC-V ISA Extension for Enabling Microscaling (MX) Floating-Point Dot Products (2505.13159v1)

Published 19 May 2025 in cs.AR

Abstract: Fast and energy-efficient low-bitwidth floating-point (FP) arithmetic is essential for AI systems. Microscaling (MX) standardized formats have recently emerged as a promising alternative to baseline low-bitwidth FP formats, offering improved accuracy with a block-wise shared exponent scale combined with per-element values. However, efficiently executing the key linear algebra primitives for AI applications on MX formats requires specialized hardware support for the fundamental operators such as scaled dot product. In this work, we propose MXDOTP, the first RISC-V ISA extension for MX dot products, focusing on the 8-bit MXFP8 FP format. We extend the open-source Snitch RISC-V core with a dedicated MXFP8 dot product-accumulate unit, which fully consumes blocks of eight 8-bit operands packed into 64-bit inputs. To feed MXDOTP at full utilization with four operands per cycle, including block scales, we exploit Snitch's Stream Semantic Registers (SSRs), achieving up to 80% utilization with minimal impact on the Snitch core's architecture and no modification to the register file. Implemented in 12 nm FinFET, a cluster with eight MXDOTP-extended cores reaches up to 356 GFLOPS/W when computing MXFP8 matrix multiplications at 0.8 V, 1 GHz. Compared to a software baseline, where MX dot products are computed by type casting FP8 inputs to FP32 for higher accumulation precision and applying explicit block scaling, the cluster achieves 25x speedup and 12.5x better energy efficiency at a minimal 5.1% area increase.

Summary

We haven't generated a summary for this paper yet.