FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization (2505.18975v4)

Published 25 May 2025 in cs.AR and cs.AI

Abstract: State Space Models (SSMs), like recent Mamba2, have achieved remarkable performance and received extensive attention. However, deploying Mamba2 on resource-constrained edge devices encounters many problems: severe outliers within the linear layer challenging the quantization, diverse and irregular element-wise tensor operations, and hardware-unfriendly nonlinear functions in the SSM block. To address these issues, this paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2. Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers. Moreover, a hardware-friendly and fine-grained power-of-two quantization framework is presented for the SSM block and convolution layer, and a first-order linear approximation is developed to optimize the nonlinear functions. Based on the accurate algorithm quantization, we propose an accelerator that integrates parallel vector processing units, pipelined execution dataflow, and an efficient SSM Nonlinear Approximation Unit, which enhances computational efficiency and reduces hardware complexity. Finally, we evaluate FastMamba on Xilinx VC709 FPGA. For the input prefill task on Mamba2-130M, FastMamba achieves 68.80\times and 8.90\times speedup over Intel Xeon 4210R CPU and NVIDIA RTX 3090 GPU, respectively. In the output decode experiment with Mamba2-2.7B, FastMamba attains 6\times higher energy efficiency than RTX 3090 GPU.

Summary

The paper proposes a hardware-algorithm co-design featuring 8-bit Hadamard quantization and PoT quantization for efficient SSM processing.
It demonstrates up to 68.80× speedup over CPU and 8.90× over GPU, ensuring reduced latency for edge deployments.
The accelerator employs first-order linear approximations for nonlinear functions, reducing DSP core usage while maintaining accuracy.

FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization

Introduction

The paper presents FastMamba, an FPGA-based accelerator designed for efficient deployment of Mamba2 models. State Space Models (SSMs) like Mamba2 exhibit significant computational efficiency over traditional Transformer architectures, particularly in processing longer sequences. The motivation for an FPGA-based solution lies in enabling edge deployment with reduced latency and increased privacy compared to GPU-dependent implementations. FastMamba addresses the challenges of deploying Mamba2 on FPGAs by emphasizing hardware-algorithm co-design, focusing on quantization and linear approximation strategies to reduce computational load without significant accuracy loss.

Hardware-Algorithm Co-Design

Accurate Quantization

FastMamba introduces an 8-bit quantization method for linear layers using the Hadamard Transformation, which alleviates the impact of outliers in activation values and weights.

Figure 1: The distributions of activation values before Hadamard transforming and with Hadamard transforming.

For convolutional layers and the SSM block, FastMamba employs a Power-of-Two (PoT) quantization framework. This strategy offers a balance between computational efficiency and precision, achieving reductions in hardware complexity while keeping accuracy degradation minimal.

Nonlinear Function Approximation

To optimize hardware performance, FastMamba implements a first-order linear approximation for nonlinear functions within the SSM block. By unifying functions like $SoftPlus$ and exponential operations into hardware-friendly formats, the architecture reduces the demand for costly resources such as DSP cores. This approximation permits quantization by PoT, further enhancing the computational efficiency.

Hardware Architecture

System Overview

FastMamba's architecture is composed of fixed-point computing units for the major computational tasks in Mamba2, such as linear and SSM computations, and floating-point units for less-demanding operations like RMS normalization and SiLU activations.

Figure 2: Overall Architecture of FastMamba.

Vector Processing Units (VPUs)

A pivotal component of FastMamba's design is the incorporation of parallel Vector Processing Units, which include adders and multipliers. These modules enable efficient execution of key computational tasks and are characterized by their scalability and modularity, as detailed in Table 1 of the original document.

Figure 3: The structure of the multipliers and adders in the VPUs.

Specialised Modules

FastMamba's Hadamard-based Linear Module and SSM Module are engineered to leverage the characteristics of the VPUs and offer specialized processing paths for the FastMamba tasks. These utilize parallelism extensively, ensuring that computation is both rapid and resource-efficient.

Experimental Results

The evaluation of FastMamba showcases substantial gains in both speed and energy efficiency compared to traditional CPU and GPU implementations. The benchmark tests conducted on Mamba2-130M revealed impressive speedup figures, with up to a 68.80× improvement over a CPU and an 8.90× enhancement over a GPU during the prompt prefill stage.

Figure 4: Comparison of speedup improvement to CPU and GPU on Mamba2-130M with different input sequence length during prompt prefill stage.

Conclusion

FastMamba represents an innovative advancement in the deployment of Mamba2 models on FPGA platforms. Through its co-design approach, integrating precise quantization methods and efficient approximation algorithms, FastMamba achieves remarkable computational efficiency and speed. This enables real-time processing capabilities on edge devices, facilitating enhanced privacy and reduced latency. The results underscore FastMamba's potential as a robust solution for high-speed, low-power AI computing, paving the way for future FPGA-based accelerators in diverse applications.