Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic Environments (2310.06125v1)

Published 9 Oct 2023 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: Speech separation remains an important topic for multi-speaker technology researchers. Convolution augmented transformers (conformers) have performed well for many speech processing tasks but have been under-researched for speech separation. Most recent state-of-the-art (SOTA) separation models have been time-domain audio separation networks (TasNets). A number of successful models have made use of dual-path (DP) networks which sequentially process local and global information. Time domain conformers (TD-Conformers) are an analogue of the DP approach in that they also process local and global context sequentially but have a different time complexity function. It is shown that for realistic shorter signal lengths, conformers are more efficient when controlling for feature dimension. Subsampling layers are proposed to further improve computational efficiency. The best TD-Conformer achieves 14.6 dB and 21.2 dB SISDR improvement on the WHAMR and WSJ0-2Mix benchmarks, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. M. Kolbaek, D. Yu, Z.-H. Tan, J. Jensen, M. Kolbaek, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 25, no. 10, p. 1901–1913, Oct. 2017.
  2. Y. Luo and N. Mesgarani, “TasNet: Time-domain audio separation network for real-time, single-channel speech separation,” in ICASSP 2018, Apr. 2018.
  3. W. Ravenscroft, S. Goetze, and T. Hain, “Utterance weighted multi-dilation temporal convolutional networks for monaural speech dereverberation,” in IWAENC 2022, Sep. 2022.
  4. J. Rixen and M. Renz, “QDPN - Quasi-dual-path Network for single-channel Speech Separation,” in Interspeech 2022, Sep. 2022.
  5. R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix, and T. Nakatani, “Far-Field Automatic Speech Recognition,” Proceedings of the IEEE, vol. 109, no. 2, pp. 124–148, 2021.
  6. N. Moritz, M. Schädler, K. Adiloğlu, B. Meyer, T. Jürgens, T. Gerkmann, B. Kollmeier, S. Doclo, and S. Goetze, “Noise Robust Distant Automatic Speech Recognition Utilizing NMF based Source Separation and Auditory Feature Extraction,” in Proc. 2nd Int. Workshop on Machine Listening in Multisource Environments (CHiME 2013), 2013.
  7. M. Maciejewski, G. Wichern, and J. Le Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” in ICASSP 2020, May 2020.
  8. W. Ravenscroft, S. Goetze, and T. Hain, “Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures,” Frontiers in Signal Processing, vol. 2, 2022.
  9. J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” in Interspeech 2020, Oct 2020.
  10. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in Interspeech 2021, July 2021.
  11. K. Li, R. Yang, and X. Hu, “An efficient encoder-decoder architecture with top-down attention for speech separation,” in ICLR 2023, May 2023.
  12. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020, Oct 2019.
  13. S. Chen, Y. Wu, Z. Chen, J. Wu, J. Li, T. Yoshioka, C. Wang, S. Liu, and M. Zhou, “Continuous speech separation with conformer,” in ICASSP 2021, Jun. 2021.
  14. T. Cord-Landwehr, C. Boeddeker, T. von Neumann, C. Zorila, R. Doddipatla, and R. Haeb-Umbach, “Monaural source separation: From anechoic to reverberant environments,” in IWAENC 2022, Sep. 2022.
  15. R. Sinha, M. Tammen, C. Rollwage, and S. Doclo, “Speaker-conditioning single-channel target speaker extraction using conformer-based architectures,” in IWAENC 2022, Sep. 2022.
  16. E. Kim and H. Seo, “SE-Conformer: Time-Domain Speech Enhancement Using Conformer,” in Interspeech 2021, 2021.
  17. W. Ravenscroft, S. Goetze, and T. Hain, “Receptive Field Analysis of Temporal Convolutional Networks for Monaural Speech Dereverberation,” in EUSIPCO 2022, Aug. 2022.
  18. D. Ditter and T. Gerkmann, “A multi-phase gammatone filterbank for speech separation via tasnet,” in ICASSP 2020, 2020.
  19. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  20. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Interspeech 2020, Oct. 2020.
  21. L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016. [Online]. Available: https://arxiv.org/abs/1607.06450
  22. S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural Networks, vol. 107, pp. 3–11, 2018.
  23. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” 2012. [Online]. Available: https://arxiv.org/abs/1207.0580
  24. Y. Wu and K. He, “Group normalization,” in ECCV 2018, Sep 2018.
  25. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017.
  26. E. Tzinis, Z. Wang, X. Jiang, and P. Smaragdis, “Compute and memory efficient universal sound source separation,” Journal of Signal Processing Systems, vol. 94, no. 2, pp. 245–259, Feb 2022.
  27. W. Ravenscroft, S. Goetze, and T. Hain, “On Data Sampling Strategies for Training Neural Network Speech Separation Models,” in EUSIPCO 2023, Sep. 2023.
  28. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in ICASSP 2019, May 2019.
  29. Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-Channel Multi-Speaker Separation using Deep Clustering,” in ICASSP 2016, Sep. 2016.
  30. M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R. De Mori, and Y. Bengio, “Speechbrain: A general-purpose speech toolkit,” 2021. [Online]. Available: https://arxiv.org/abs/2106.04624
  31. “Thop: Pytorch-opcounter,” https://pypi.org/project/thop/, accessed: 19-10-2022.
  32. N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2840–2849, 2021.
  33. W. Ravenscroft, S. Goetze, and T. Hain, “Deformable temporal convolutional networks for monaural noisy reverberant speech separation,” in ICASSP 2023, Jun. 2023.
  34. C. Li, L. Yang, W. Wang, and Y. Qian, “Skim: Skipping memory lstm for low-latency real-time continuous speech separation,” in ICASSP 2022, May 2022.
  35. S. Zhao and B. Ma, “Mossformer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” in ICASSP 2023, Jun. 2023.
  36. Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Integrating full- and sub-band modeling for speech separation,” 2022. [Online]. Available: https://arxiv.org/abs/2211.12433
  37. ——, “TF-GridNet: Making time-frequency domain models great again for monaural speaker separation,” in ICASSP 2023, Jun. 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. William Ravenscroft (8 papers)
  2. Stefan Goetze (20 papers)
  3. Thomas Hain (58 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.