SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model (2405.11831v1)
Abstract: Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling the model to learn robust audio representations from large-scale, unlabeled datasets. We evaluated SSAMBA on various tasks such as audio classification, keyword spotting, and speaker identification. Our results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks. Notably, SSAMBA is approximately 92.7% faster in batch inference speed and 95.4% more memory-efficient than SSAST for the tiny model size with an input token size of 22k. These efficiency gains, combined with superior performance, underscore the effectiveness of SSAMBA's architectural innovation, making it a compelling choice for a wide range of audio processing applications.
- “Self-Supervised Speech Representation Learning: A Review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, Oct. 2022.
- “Ssast: Self-supervised audio spectrogram transformer,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 10699–10709, Jun. 2022.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
- “AST: Audio Spectrogram Transformer,” in Proc. Interspeech 2021, 2021, pp. 571–575.
- Rudolph Emil Kalman, “A new approach to linear filtering and prediction problems,” 1960.
- “Combining recurrent, convolutional, and continuous-time models with linear state-space layers,” CoRR, vol. abs/2110.13985, 2021.
- “Efficiently modeling long sequences with structured state spaces,” CoRR, vol. abs/2111.00396, 2021.
- “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv, Dec 2023.
- “Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model,” Feb. 2024, arXiv:2401.09417 [cs].
- “Swin-umamba: Mamba-based unet with imagenet-based pretraining,” 2024.
- “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” 2024.
- “Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation,” 2024.
- “Videomamba: State space model for efficient video understanding,” 2024.
- “Graph-mamba: Towards long-range graph sequence modeling with selective state spaces,” 2024.
- “Multichannel long-term streaming neural speech enhancement for static and moving speakers,” arXiv preprint arXiv:2403.07675, 2024.
- “Tramba: A hybrid transformer and mamba architecture for practical audio and bone conduction speech super resolution and enhancement on mobile and wearable platforms,” 2024.
- “Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation,” Apr. 2024, arXiv:2403.18257 [cs, eess].
- “SSAST: Self-Supervised Audio Spectrogram Transformer,” Feb. 2022, arXiv:2110.09784 [cs, eess].
- “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
- “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015.
- Karol J Piczak, “Esc: Dataset for environmental sound classification,” in Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 1015–1018.
- Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
- “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
- Siavash Shams (2 papers)
- Sukru Samet Dindar (3 papers)
- Xilin Jiang (17 papers)
- Nima Mesgarani (45 papers)