VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers (2504.11227v1)

Published 15 Apr 2025 in cs.AR and cs.LG

Abstract: While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1\%. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7$\times$ less latency and 74.3$\times$ less energy compared to the baseline cluster, achieving an 8.2$\times$ performance improvement and 4.1$\times$ higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8$\times$ and 3.6$\times$ reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.

Summary

The paper presents VEXP, a custom ISA extension that speeds up BF16 exponentiation using an enhanced Schraudolph method.
The methodology integrates the custom EXP block into RISC-V cores with minimal hardware overhead, achieving up to 162.7x lower latency and 74.3x lower energy.
The paper demonstrates a co-design approach with optimized Softmax kernels, boosting throughput by 8.2x and improving energy efficiency by 4.1x without affecting accuracy.

This paper addresses the performance bottleneck caused by the Softmax function in Transformer models, particularly after accelerating the dominant matrix multiplication (GEMM) operations. Standard Softmax implementations, especially the exponentiation step ( $e^x$ ), become significant contributors to latency and energy consumption, hindering efficient deployment on resource-constrained platforms. Existing solutions often involve complex hardware, lack flexibility, or require model retraining, which is impractical for large models.

To overcome this, the authors propose VEXP, a low-cost RISC-V ISA extension specifically designed to accelerate Bfloat16 (BF16) exponentiation, a common format in Transformers. The core contributions include:

Custom Exponential Arithmetic Block: They designed a hardware block based on an enhanced version of Schraudolph's fast exponential approximation method. This method leverages the floating-point representation for a quick initial approximation ( $2^{\mathrm{int}(x')} \cdot (1+\mathrm{frac}(x'))$ ) and refines it using a piecewise polynomial correction ( $P(x)$ ) applied to the fractional part for improved accuracy in BF16. The polynomial parameters were optimized heuristically.
RISC-V FPU Integration and ISA Extension: This custom EXP block was integrated into the multi-format FPU of the RISC-V cores within the Snitch compute cluster architecture. The integration adds minimal hardware overhead. Two new instructions were added to the RISC-V ISA: FEXP for scalar BF16 exponentiation and VFEXP for packed-SIMD (4x BF16) exponentiation, allowing software to directly utilize the hardware accelerator.
Hardware/Software Co-Design: Optimized software kernels for Softmax and the partial Softmax within FlashAttention-2 were developed. These kernels leverage the new VFEXP instruction along with existing Snitch extensions like FREP (hardware loops) and SSR (stream semantic registers for memory access) and SIMD instructions to maximize parallelism and minimize overhead for finding the maximum value, summing exponentials, and normalization.
Physical Implementation and Evaluation: The extended Snitch cluster was implemented using GlobalFoundries 12nm technology.
- Cost: The VEXP extension adds only a 1.0% area overhead at the cluster level and a negligible 1.8% power overhead during workloads not using the EXP instruction.
- Efficiency: The energy per exponentiation operation was reduced by over 500x (from 3433 pJ/Op to 6.39 pJ/Op).
- Accuracy: Evaluated on pre-trained GPT-2 and ViT models, the VEXP approximation resulted in negligible accuracy degradation (<0.1%) compared to standard BF16, eliminating the need for model retraining.
- Performance:
  - The optimized Softmax kernel achieved up to 162.7x lower latency and 74.3x lower energy compared to a baseline C implementation on the Snitch cluster.
  - Integrating the optimized Softmax into FlashAttention-2 yielded up to 8.2x higher throughput and 4.1x better energy efficiency.
  - On a scaled-up 16-cluster system, end-to-end inference for models like GPT-2, GPT-3, and ViT showed significant gains, with up to 5.8x latency reduction and 3.6x energy reduction compared to the baseline system without VEXP.

The VEXP approach demonstrates that targeted ISA extensions for critical non-linear functions like exponentiation can provide substantial performance and energy benefits for complex models like Transformers with minimal hardware cost and without compromising accuracy or requiring model retraining. This makes it a practical solution for accelerating AI workloads on RISC-V based systems, particularly in edge or energy-constrained scenarios.

Implementation Details Summary:

Algorithm: Enhanced Schraudolph method for BF16 $e^x$ $e^{x}$ .
- $x' = x / \ln(2)$
- $\exp(x) \approx 2^{\lfloor x' \rfloor} \cdot (1 + P(x' - \lfloor x' \rfloor))$
- $P(x)$ is a piecewise polynomial ( $ax(x+b)$ form) optimized for BF16.
Hardware: Dedicated EXP block integrated into Snitch FPU.
- Supports 4-way BF16 SIMD (VFEXP).
- 2-cycle latency, 1-cycle throughput per VFEXP.
ISA: FEXP (scalar), VFEXP (vector) custom instructions.
Software: Kernels optimized using VFEXP, FREP, SSR, and other SIMD instructions (VFMAX, VFADD, VFMUL).
Platform: Snitch RISC-V cluster (8 cores, 128KiB SPM, DMA).
Technology: GlobalFoundries 12nm.

The paper provides a comprehensive hardware/software co-design methodology for accelerating a specific bottleneck in Transformers, showcasing the benefits of domain-specific ISA extensions on programmable architectures like RISC-V.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (7)

Tweets

https://twitter.com/pulp_platform/status/1912435994659283094

https://twitter.com/arxivsanitybot/status/1912698510379561432

https://twitter.com/MuzafferKal_/status/1918140588421267646