Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation (2403.18257v2)

Published 27 Mar 2024 in eess.AS and cs.SD

Abstract: Transformers have been the most successful architecture for various speech modeling tasks, including speech separation. However, the self-attention mechanism in transformers with quadratic complexity is inefficient in computation and memory. Recent models incorporate new layers and modules along with transformers for better performance but also introduce extra model complexity. In this work, we replace transformers with Mamba, a selective state space model, for speech separation. We propose dual-path Mamba, which models short-term and long-term forward and backward dependency of speech signals using selective state spaces. Our experimental results on the WSJ0-2mix data show that our dual-path Mamba models of comparably smaller sizes outperform state-of-the-art RNN model DPRNN, CNN model WaveSplit, and transformer model Sepformer. Code: https://github.com/xi-j/Mamba-TasNet

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.
  2. Long Short-Term Memory, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 2010.
  3. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  4. Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  5. “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 46–50.
  6. “Attention is all you need in speech separation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 21–25.
  7. “Quantifying the vanishing gradient and long distance dependency problem in recursive neural networks and recursive LSTMs,” in Proceedings of the 1st Workshop on Representation Learning for NLP, Phil Blunsom, Kyunghyun Cho, Shay Cohen, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, and Scott Wen-tau Yih, Eds., Berlin, Germany, Aug. 2016, pp. 87–93, Association for Computational Linguistics.
  8. “Exploring self-attention mechanisms for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2169–2180, 2022.
  9. “QDPN - Quasi-dual-path Network for single-channel Speech Separation,” in Proc. Interspeech 2022, 2022, pp. 5353–5357.
  10. Rudolph Emil Kalman, “A new approach to linear filtering and prediction problems,” 1960.
  11. “Combining recurrent, convolutional, and continuous-time models with linear state space layers,” Advances in neural information processing systems, vol. 34, pp. 572–585, 2021.
  12. “Efficiently modeling long sequences with structured state spaces,” in The International Conference on Learning Representations (ICLR), 2022.
  13. “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  14. “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417, 2024.
  15. “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35.
  16. “Wavesplit: End-to-end speech separation by speaker clustering,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2840–2849, 2021.
  17. “Voice separation with an unknown number of multiple speakers,” in International Conference on Machine Learning. PMLR, 2020, pp. 7164–7175.
  18. “Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation,” 2023.
  19. “End-to-end source separation with adaptive front-ends,” in 2018 52nd asilomar conference on signals, systems, and computers. IEEE, 2018, pp. 684–688.
  20. Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700.
  21. “Sudo rm-rf: Efficient networks for universal audio source separation,” in 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2020, pp. 1–6.
  22. “Compute and memory efficient universal sound source separation,” Journal of Signal Processing Systems, vol. 94, no. 2, pp. 245–259, 2022.
  23. “Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation,” in Proc. Interspeech 2020, 2020, pp. 2642–2646.
  24. “On time domain conformer models for monaural speech separation in noisy reverberant acoustic environments,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–7.
  25. “Mossformer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  26. “Separate and diffuse: Using a pretrained diffusion model for better source separation,” in The Twelfth International Conference on Learning Representations, 2024.
  27. “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” arXiv preprint arXiv:2401.04722, 2024.
  28. “Graph-mamba: Towards long-range graph sequence modeling with selective state spaces,” arXiv preprint arXiv:2402.00789, 2024.
  29. “Motion mamba: Efficient and long sequence motion generation with hierarchical and bidirectional selective ssm,” arXiv preprint arXiv:2403.07487, 2024.
  30. “Vivim: a video vision mamba for medical video object segmentation,” arXiv preprint arXiv:2401.14168, 2024.
  31. “Pointmamba: A simple state space model for point cloud analysis,” arXiv preprint arXiv:2402.10739, 2024.
  32. “Multichannel long-term streaming neural speech enhancement for static and moving speakers,” arXiv preprint arXiv:2403.07675, 2024.
  33. “A neural state-space model approach to efficient speech separation,” 2023.
  34. “Root mean square layer normalization,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  35. “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  36. “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017.
  37. “Sdr–half-baked or well done?,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.
  38. “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015.
  39. “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xilin Jiang (17 papers)
  2. Cong Han (27 papers)
  3. Nima Mesgarani (45 papers)
Citations (27)