- The paper presents VEXP, a custom ISA extension that speeds up BF16 exponentiation using an enhanced Schraudolph method.
- The methodology integrates the custom EXP block into RISC-V cores with minimal hardware overhead, achieving up to 162.7x lower latency and 74.3x lower energy.
- The paper demonstrates a co-design approach with optimized Softmax kernels, boosting throughput by 8.2x and improving energy efficiency by 4.1x without affecting accuracy.
This paper addresses the performance bottleneck caused by the Softmax function in Transformer models, particularly after accelerating the dominant matrix multiplication (GEMM) operations. Standard Softmax implementations, especially the exponentiation step (ex), become significant contributors to latency and energy consumption, hindering efficient deployment on resource-constrained platforms. Existing solutions often involve complex hardware, lack flexibility, or require model retraining, which is impractical for large models.
To overcome this, the authors propose VEXP, a low-cost RISC-V ISA extension specifically designed to accelerate Bfloat16 (BF16) exponentiation, a common format in Transformers. The core contributions include:
- Custom Exponential Arithmetic Block: They designed a hardware block based on an enhanced version of Schraudolph's fast exponential approximation method. This method leverages the floating-point representation for a quick initial approximation (2int(x′)⋅(1+frac(x′))) and refines it using a piecewise polynomial correction (P(x)) applied to the fractional part for improved accuracy in BF16. The polynomial parameters were optimized heuristically.
- RISC-V FPU Integration and ISA Extension: This custom EXP block was integrated into the multi-format FPU of the RISC-V cores within the Snitch compute cluster architecture. The integration adds minimal hardware overhead. Two new instructions were added to the RISC-V ISA:
FEXP
for scalar BF16 exponentiation and VFEXP
for packed-SIMD (4x BF16) exponentiation, allowing software to directly utilize the hardware accelerator.
- Hardware/Software Co-Design: Optimized software kernels for Softmax and the partial Softmax within FlashAttention-2 were developed. These kernels leverage the new
VFEXP
instruction along with existing Snitch extensions like FREP (hardware loops) and SSR (stream semantic registers for memory access) and SIMD instructions to maximize parallelism and minimize overhead for finding the maximum value, summing exponentials, and normalization.
- Physical Implementation and Evaluation: The extended Snitch cluster was implemented using GlobalFoundries 12nm technology.
- Cost: The VEXP extension adds only a 1.0% area overhead at the cluster level and a negligible 1.8% power overhead during workloads not using the EXP instruction.
- Efficiency: The energy per exponentiation operation was reduced by over 500x (from 3433 pJ/Op to 6.39 pJ/Op).
- Accuracy: Evaluated on pre-trained GPT-2 and ViT models, the VEXP approximation resulted in negligible accuracy degradation (<0.1%) compared to standard BF16, eliminating the need for model retraining.
- Performance:
- The optimized Softmax kernel achieved up to 162.7x lower latency and 74.3x lower energy compared to a baseline C implementation on the Snitch cluster.
- Integrating the optimized Softmax into FlashAttention-2 yielded up to 8.2x higher throughput and 4.1x better energy efficiency.
- On a scaled-up 16-cluster system, end-to-end inference for models like GPT-2, GPT-3, and ViT showed significant gains, with up to 5.8x latency reduction and 3.6x energy reduction compared to the baseline system without VEXP.
The VEXP approach demonstrates that targeted ISA extensions for critical non-linear functions like exponentiation can provide substantial performance and energy benefits for complex models like Transformers with minimal hardware cost and without compromising accuracy or requiring model retraining. This makes it a practical solution for accelerating AI workloads on RISC-V based systems, particularly in edge or energy-constrained scenarios.
Implementation Details Summary:
- Algorithm: Enhanced Schraudolph method for BF16 ex.
- x′=x/ln(2)
- exp(x)≈2⌊x′⌋⋅(1+P(x′−⌊x′⌋))
- P(x) is a piecewise polynomial (ax(x+b) form) optimized for BF16.
- Hardware: Dedicated EXP block integrated into Snitch FPU.
- Supports 4-way BF16 SIMD (
VFEXP
).
- 2-cycle latency, 1-cycle throughput per
VFEXP
.
- ISA:
FEXP
(scalar), VFEXP
(vector) custom instructions.
- Software: Kernels optimized using
VFEXP
, FREP
, SSR
, and other SIMD instructions (VFMAX
, VFADD
, VFMUL
).
- Platform: Snitch RISC-V cluster (8 cores, 128KiB SPM, DMA).
- Technology: GlobalFoundries 12nm.
The paper provides a comprehensive hardware/software co-design methodology for accelerating a specific bottleneck in Transformers, showcasing the benefits of domain-specific ISA extensions on programmable architectures like RISC-V.