Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SPMamba: State-space model is all you need in speech separation (2404.02063v2)

Published 2 Apr 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Existing CNN-based speech separation models face local receptive field limitations and cannot effectively capture long time dependencies. Although LSTM and Transformer-based speech separation models can avoid this problem, their high complexity makes them face the challenge of computational resources and inference efficiency when dealing with long audio. To address this challenge, we introduce an innovative speech separation method called SPMamba. This model builds upon the robust TF-GridNet architecture, replacing its traditional BLSTM modules with bidirectional Mamba modules. These modules effectively model the spatiotemporal relationships between the time and frequency dimensions, allowing SPMamba to capture long-range dependencies with linear computational complexity. Specifically, the bidirectional processing within the Mamba modules enables the model to utilize both past and future contextual information, thereby enhancing separation performance. Extensive experiments conducted on public datasets, including WSJ0-2Mix, WHAM!, and Libri2Mix, as well as the newly constructed Echo2Mix dataset, demonstrated that SPMamba significantly outperformed existing state-of-the-art models, achieving superior results while also reducing computational complexity. These findings highlighted the effectiveness of SPMamba in tackling the intricate challenges of speech separation in complex environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  2. K. Li, X. Hu, and Y. Luo, “On the Use of Deep Mask Estimation Module for Neural Source Separation Systems,” in Proc. Interspeech 2022, 2022, pp. 5328–5332.
  3. K. Li and Y. Luo, “On the design and training strategies for rnn-based online neural speech separation systems,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  4. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 46–50.
  5. J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” arXiv preprint arXiv:2007.13975, 2020.
  6. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 21–25.
  7. L. Yang, W. Liu, and W. Wang, “Tfpsnet: Time-frequency domain path scanning network for speech separation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6842–6846.
  8. E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo rm-rf: Efficient networks for universal audio source separation,” in 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP).   IEEE, 2020, pp. 1–6.
  9. K. Li, R. Yang, and X. Hu, “An efficient encoder-decoder architecture with top-down attention for speech separation,” in The Eleventh International Conference on Learning Representations, 2022.
  10. X. Hu, K. Li, W. Zhang, Y. Luo, J.-M. Lemercier, and T. Gerkmann, “Speech separation using an asynchronous fully recurrent convolutional neural network,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 509–22 522, 2021.
  11. Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “Tf-gridnet: Making time-frequency domain models great again for monaural speaker separation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  12. A. Gu, K. Goel, and C. Re, “Efficiently modeling long sequences with structured state spaces,” in International Conference on Learning Representations, 2021.
  13. A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and C. Ré, “Combining recurrent, convolutional, and continuous-time models with linear state space layers,” in Advances in neural information processing systems, 2021, pp. 572–585.
  14. M. Pióro, K. Ciebiera, K. Król, J. Ludziejewski, and S. Jaszczur, “Moe-mamba: Efficient selective state space models with mixture of experts,” arXiv preprint arXiv:2401.04081, 2024.
  15. Z. Yang, A. Mitra, S. Kwon, and H. Yu, “Clinicalmamba: A generative clinical language model on longitudinal clinical notes,” arXiv preprint arXiv:2403.05795, 2024.
  16. A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  17. Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, and Y. Liu, “Vmamba: Visual state space model,” arXiv preprint arXiv:2401.10166, 2024.
  18. J. Liu, H. Yang, H.-Y. Zhou, Y. Xi, L. Yu, Y. Yu, Y. Liang, G. Shi, S. Zhang, H. Zheng et al., “Swin-umamba: Mamba-based unet with imagenet-based pretraining,” arXiv preprint arXiv:2402.03302, 2024.
  19. C. Chen, C.-H. H. Yang, K. Li, Y. Hu, P.-J. Ku, and E. S. Chng, “A neural state-space model approach to efficient speech separation,” arXiv preprint arXiv:2305.16932, 2023.
  20. D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 241–245.
  21. M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
  22. E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
  23. V. Sovrasov. (2023) ptflops: a flops counting tool for neural networks in pytorch framework. [Online]. Available: https://github.com/sovrasov/flops-counter.pytorch
  24. Y. Luo and J. Yu, “Music source separation with band-split rnn,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  25. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  26. G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “Wham!: Extending speech separation to noisy environments,” arXiv preprint arXiv:1907.01160, 2019.
  27. D. Petermann, G. Wichern, Z.-Q. Wang, and J. Le Roux, “The cocktail fork problem: Three-stem audio separation for real-world soundtracks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 526–530.
  28. E. Grimm, R. Van Everdingen, and M. Schöpping, “Toward a recommendation for a european standard of peak and lkfs loudness levels,” SMPTE motion imaging journal, vol. 119, no. 3, pp. 28–34, 2010.
  29. C. Chen, C. Schissler, S. Garg, P. Kobernik, A. Clegg, P. Calamia, D. Batra, P. Robinson, and K. Grauman, “Soundspaces 2.0: A simulation platform for visual-acoustic learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 8896–8911, 2022.
  30. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  31. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 626–630.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kai Li (313 papers)
  2. Guo Chen (107 papers)
  3. Runxuan Yang (6 papers)
  4. Xiaolin Hu (97 papers)

Summary

SPMamba: Advancing Speech Separation with State-Space Models

Introduction to SPMamba

Speech separation technology is essential for improving audio clarity in environments with overlapping speakers, facilitating advancements in audio analysis and clearer communication. Recent developments have leveraged CNNs, RNNs, and Transformer architectures, each presenting unique benefits and limitations in processing audio signals. Conventional CNN-based models, despite their robustness in handling various auditory tasks, struggle with limited receptive fields that hinder their performance in capturing the full context of long audio sequences. On the opposite end, Transformer-based methods excel in modeling long-range dependencies but suffer from high computational demands, rendering them less practical for real-time applications.

State-Space Models (SSMs) have emerged as a promising solution, offering efficient processing of long sequences through long-range dependencies with a manageable computational footprint. This paper introduces SPMamba, a novel architecture that integrates the State-Space Model approach into speech separation, significantly enhancing separation quality and computational efficiency.

Background and Model Design

The Mamba Technique

The Mamba method, a precursor to the SPMamba model, represents a new direction in speech separation tasks. It introduces a selective State-Space Model that synergizes the benefits of CNNs and RNNs while mitigating their respective limitations. The Mamba architecture, with its selective mechanism, adjusts its processing based on the input, dynamically focusing on relevant parts of the audio signal for separation. This method is not only efficient in its computational design but also adept at handling the complexities inherent in speech separation tasks, thanks to its innovative approach to modeling the audio sequences.

SPMamba Architecture

SPMamba, building upon the foundational TF-GridNet model, innovates by incorporating a bidirectional Mamba module, replacing the Transformer component traditionally used. This modification enhances the model's ability to capture a broader range of contextual information within audio sequences, making significant strides in addressing the constraints faced by CNN and RNN methods in speech separation. The architecture of SPMamba is meticulously designed, featuring:

  • A bidirectional Mamba layer as the core, enabling effective modeling of both forward and backward sequences in non-causal speech separation tasks.
  • Integration within the TF-GridNet framework, leveraging its strengths in handling time-frequency dimensions while improving efficiency through the Mamba module.

Empirical Evaluation

The effectiveness of SPMamba was rigorously evaluated on a challenging dataset filled with noise and reverberation intricacies. The model demonstrated outstanding performance, outclassing existing speech separation models across several metrics. Notably, SPMamba achieved a remarkable 2.42 dB improvement in SI-SNRi over its baseline, TF-GridNet, while also showcasing significant reductions in the number of parameters and overall computational footprint. These results underscore the model's superior capability in delivering high-quality speech separation with enhanced efficiency.

Conclusion and Future Directions

SPMamba sets a new benchmark in the field of speech separation by adeptly integrating the benefits of State-Space Models. The superior performance and efficiency of SPMamba not only address the current challenges in speech separation technology but also open up new avenues for future research. The scalability and adaptability of SPMamba suggest a broad potential for further advancements in audio processing tasks, challenging the research community to explore the integration of SSMs in other domains of AI.