Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition (2411.06968v1)

Published 11 Nov 2024 in cs.SD and eess.AS

Abstract: Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remains. This paper explores the capability of Mamba as the decoder-only architecture in ASR task. Our MAmba-based DEcoder-ONly approach (MADEON) consists of a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner. To enhance MADEON, we further propose speech prefixing that performs bidirectional processing on speech tokens, which enriches the contextual information in the hidden states. Our experiments show that MADEON significantly outperforms a non-selective SSM. The combination of speech prefixing and the recently proposed Mamba-2 yields comparable performance to Transformer-based models on large datasets.

MAMBA-Based Decoder-Only Architecture for Speech Recognition

The paper introduces a novel approach to automatic speech recognition (ASR) leveraging Mamba-based selective state space models (SSMs) in a decoder-only framework, termed as MAmba-based DEcoder-ONly (MADEON). This architecture diverges from traditional encoder-decoder models by employing a single decoder to model the sequence from speech tokens to text tokens in an autoregressive manner. Notably, the absence of an encoder and cross-attention reduces computational complexity, which is often a challenge in attention-based models due to their quadratic scaling with sequence length.

Central to the MADEON framework is the exploration of bidirectional speech modeling through a technique called speech prefixing, which enriches the contextual information derived from speech tokens. The experimental results underscore that MADEON outperforms non-selective SSMs and achieves performance on par with Transformer models on large datasets when combined with Mamba-2, an advanced version that allows more efficient handling of larger hidden states.

Technical Contributions

  1. Decoder-Only Approach with Mamba: The paper successfully implements a decoder-only ASR system using Mamba, showing that the selective SSM significantly enhances performance over non-selective models like S4 in terms of WER.
  2. Speech Prefixing: This technique allows for bidirectional processing of the speech tokens, enhancing the model's ability to contextually model input speech sequences and improve transcription accuracy.
  3. Integration with Mamba-2: By leveraging better resource utilization with Mamba-2, they demonstrate effectiveness in handling larger state sizes, contributing to a reduction in WER, particularly on substantial datasets.
  4. Efficiency and Scalability: The experiments verify that MADEON, especially in conjunction with Mamba-2, reduces computational and memory demands compared to Transformer-based solutions, making it a robust candidate for large-scale ASR tasks.

Experimental Results

The empirical evaluation covers a wide array of datasets, extending from English to non-English speech, showcasing the adaptability of the MADEON approach. Notably, for datasets such as LibriSpeech 960h and GigaSpeech, MADEON-2 with speech prefixing rivals or even surpasses Transformer-based architectures. Furthermore, the evaluation illuminates that language-dependent SSL models significantly enhance performance for ASR tasks when paired with discrete speech tokens, emphasizing the role of appropriate model selection.

Theoretical and Practical Implications

The introduction of Mamba in a decoder-only setup potentially reshapes ASR model design by highlighting the benefits of minimizing reliance on complex encoders and cross-attention mechanisms. The ability to reduce computational load without sacrificing performance makes MADEON particularly appealing for real-time applications and large-scale deployments where efficiency is critical. Future developments might concentrate on optimizing SSM training mechanisms and exploring further integration with different subword modeling techniques to push the performance boundaries.

Conclusion

The paper effectively broadens the scope of decoder-only architectures in ASR by harnessing the power of selective SSMs with innovations such as speech prefixing and Mamba-2. Its findings endorse a paradigm shift away from traditional encoder-decoder frameworks towards more computationally viable models that do not compromise on accuracy. This research paves the way for continued exploration into efficient sequence modeling techniques, potentially influencing broader applications in natural language processing and beyond.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yoshiki Masuyama (30 papers)
  2. Koichi Miyazaki (6 papers)
  3. Masato Murata (4 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com