MAMBA-Based Decoder-Only Architecture for Speech Recognition
The paper introduces a novel approach to automatic speech recognition (ASR) leveraging Mamba-based selective state space models (SSMs) in a decoder-only framework, termed as MAmba-based DEcoder-ONly (MADEON). This architecture diverges from traditional encoder-decoder models by employing a single decoder to model the sequence from speech tokens to text tokens in an autoregressive manner. Notably, the absence of an encoder and cross-attention reduces computational complexity, which is often a challenge in attention-based models due to their quadratic scaling with sequence length.
Central to the MADEON framework is the exploration of bidirectional speech modeling through a technique called speech prefixing, which enriches the contextual information derived from speech tokens. The experimental results underscore that MADEON outperforms non-selective SSMs and achieves performance on par with Transformer models on large datasets when combined with Mamba-2, an advanced version that allows more efficient handling of larger hidden states.
Technical Contributions
- Decoder-Only Approach with Mamba: The paper successfully implements a decoder-only ASR system using Mamba, showing that the selective SSM significantly enhances performance over non-selective models like S4 in terms of WER.
- Speech Prefixing: This technique allows for bidirectional processing of the speech tokens, enhancing the model's ability to contextually model input speech sequences and improve transcription accuracy.
- Integration with Mamba-2: By leveraging better resource utilization with Mamba-2, they demonstrate effectiveness in handling larger state sizes, contributing to a reduction in WER, particularly on substantial datasets.
- Efficiency and Scalability: The experiments verify that MADEON, especially in conjunction with Mamba-2, reduces computational and memory demands compared to Transformer-based solutions, making it a robust candidate for large-scale ASR tasks.
Experimental Results
The empirical evaluation covers a wide array of datasets, extending from English to non-English speech, showcasing the adaptability of the MADEON approach. Notably, for datasets such as LibriSpeech 960h and GigaSpeech, MADEON-2 with speech prefixing rivals or even surpasses Transformer-based architectures. Furthermore, the evaluation illuminates that language-dependent SSL models significantly enhance performance for ASR tasks when paired with discrete speech tokens, emphasizing the role of appropriate model selection.
Theoretical and Practical Implications
The introduction of Mamba in a decoder-only setup potentially reshapes ASR model design by highlighting the benefits of minimizing reliance on complex encoders and cross-attention mechanisms. The ability to reduce computational load without sacrificing performance makes MADEON particularly appealing for real-time applications and large-scale deployments where efficiency is critical. Future developments might concentrate on optimizing SSM training mechanisms and exploring further integration with different subword modeling techniques to push the performance boundaries.
Conclusion
The paper effectively broadens the scope of decoder-only architectures in ASR by harnessing the power of selective SSMs with innovations such as speech prefixing and Mamba-2. Its findings endorse a paradigm shift away from traditional encoder-decoder frameworks towards more computationally viable models that do not compromise on accuracy. This research paves the way for continued exploration into efficient sequence modeling techniques, potentially influencing broader applications in natural language processing and beyond.