Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation (2403.03411v1)

Published 6 Mar 2024 in cs.SD and eess.AS

Abstract: We introduce CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, pp. 1702–1726, 2018.
  2. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35.
  3. M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, pp. 1901–1913, 2017.
  4. Y. Liu and D. Wang, “Divide and conquer: A deep casa approach to talker-independent monaural speaker separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, pp. 2092–2102, 2019.
  5. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, pp. 1256–1266, 2019.
  6. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. ICASSP, 2020, pp. 46–50.
  7. C. Subakan, M. Ravanelli, S. Cornell, F. Grondin, and M. Bronzi, “Exploring self-attention mechanisms for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 2169–2180, 2023.
  8. H. Taherian, K. Tan, and D. Wang, “Multi-channel talker-independent speaker separation through location-based training,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 2791–2800, 2022.
  9. Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Integrating full-and sub-band modeling for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., pp. 3221–3236, 2023.
  10. D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, pp. 483–492, 2016.
  11. S.-W. Fu, T.-y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhancement by convolutional neural network with multi-metrics learning,” in Proc. MLSP, 2017, pp. 1–6.
  12. K. Tan and D. Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 380–390, 2019.
  13. Z.-Q. Wang and D. Wang, “Deep learning based target cancellation for speech dereverberation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 941–950, 2020.
  14. C. Quan and X. Li, “Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 1310–1323, 2024.
  15. H. Taherian and D. Wang, “Multi-channel conversational speaker separation via neural diarization,” arXiv:2311.08630, 2023.
  16. J. Rosendahl, V. A. K. Tran, W. Wang, and H. Ney, “Analysis of positional encodings for neural machine translation,” in Proc. Spoken Language Translation, 2019.
  17. J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” arXiv:2007.13975, 2020.
  18. F. Andayani, L. B. Theng, M. T. Tsun, and C. Chua, “Hybrid LSTM-transformer model for emotion recognition from speech audio files,” IEEE Access, vol. 10, pp. 36 018–36 027, 2022.
  19. K. Tan, Z.-Q. Wang, and D. Wang, “Neural spectrospatial filtering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 605–621, 2022.
  20. A. Ruoss, G. Delétang, T. Genewein, J. Grau-Moya, R. Csordás, M. Bennani, S. Legg, and J. Veness, “Randomized positional encodings boost length generalization of transformers,” arXiv:2305.16843, 2023.
  21. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Adv. Neural Info. Proc. Sys., vol. 30, 2017.
  22. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv:2005.08100, 2020.
  23. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in Proc. ICASSP, 2019, pp. 626–630.
  24. Z. Pan, M. Ge, and H. Li, “A hybrid continuity loss to reduce over-suppression for time-domain target speaker extraction,” arXiv:2203.16843, 2022.
  25. D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017, pp. 241–245.
  26. N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2840–2849, 2021.
  27. M. Maciejewski, G. Wichern, E. McQuinn, and J. Le Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” in Proc. ICASSP, 2020, pp. 696–700.
  28. L. Drude, J. Heitkaemper, C. Boeddeker, and R. Haeb-Umbach, “SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition,” arXiv:1910.13934, 2019.
  29. J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, pp. 943–950, 1979.
  30. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in Proc. Workshop ASRU, 2011.
  31. E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, pp. 1462–1469, 2006.
  32. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSSP, vol. 2, 2001, pp. 749–752.
  33. J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Tran. Audio, Speech, Lang. Process., vol. 24, pp. 2009–2022, 2016.
  34. N. S. Detlefsen, J. Borovec, J. Schock, A. H. Jha, T. Koker, L. Di Liello, D. Stancl, C. Quan, M. Grechkin, and W. Falcon, “TorchMetrics-measuring reproducibility in pytorch,” Journal of Open Source Software, vol. 7, p. 4101, 2022.
  35. Z.-Q. Wang, “ESPnet2 pretrained model,” 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7565926
  36. L. Zhang, Z. Shi, J. Han, A. Shi, and D. Ma, “FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks,” in Proc. MultiMedia Modeling.   Springer, 2020, pp. 653–665.
  37. E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo Rm-Rf: Efficient networks for universal audio source separation,” in Proc. MLSP, 2020, pp. 1–6.
  38. Y. Zhu, X. Zheng, X. Wu, W. Liu, L. Pi, and M. Chen, “DPTCN-ATPP: Multi-scale end-to-end modeling for single-channel speech separation,” in 2021 5th International Conference on Communication and Information Systems (ICCIS), 2021, pp. 39–44.
  39. M. W. Lam, J. Wang, D. Su, and D. Yu, “Sandglasset: A light multi-granularity self-attentive network for time-domain speech separation,” in Proc. ICASSP, 2021, pp. 5759–5763.
  40. L. Yang, W. Liu, and W. Wang, “TFPSNet: Time-frequency domain path scanning network for speech separation,” in Proc. ICASSP, 2022, pp. 6842–6846.
  41. J. Rixen and M. Renz, “SFSRNet: Super-resolution for single-channel audio source separation,” in Proc. AAAI Artificial Intelligence, vol. 36, 2022, pp. 11 220–11 228.
  42. ——, “QDPN: Quasi-dual-path network for single-channel speech separation,” in Proc. Interspeech, 2022, pp. 5353–5357.
  43. S. Zhao and B. Ma, “MossFormer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” in Proc. ICASSP, 2023, pp. 1–5.
  44. Z.-Q. Wang, P. Wang, and D. Wang, “Multi-microphone complex spectral mapping for utterance-wise and continuous speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2001–2014, 2021.
  45. Z.-Q. Wang, G. Wichern, and J. Le Roux, “Convolutive prediction for reverberant speech separation,” in Proc. WASPAA, 2021, pp. 56–60.
  46. Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” in Proc. ICASSP, 2020, pp. 6394–6398.
  47. J. Zhang, C. Zorilă, R. Doddipatla, and J. Barker, “On end-to-end multi-channel time domain speech separation in reverberant environments,” in Proc. ICASSP, 2020, pp. 6389–6393.
  48. X. Li and R. Horaud, “Narrow-band deep filtering for multichannel speech enhancement,” arXiv:1911.10791, 2019.
  49. K. Tesch and T. Gerkmann, “Insights into deep non-linear filters for improved multi-channel speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 563–575, 2022.
  50. Y. Yang, C. Quan, and X. Li, “Mcnet: Fuse multiple cues for multichannel speech enhancement,” in Proc. ICASSP, 2023, pp. 1–5.
  51. S. Wang, X. Kong, X. Peng, H. Movassagh, V. Prakash, and Y. Lu, “DASFormer: Deep alternating spectrogram transformer for multi/single-channel speech separation,” in Proc. ICASSP, 2023, pp. 1–5.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Vahid Ahmadi Kalkhorani (4 papers)
  2. DeLiang Wang (43 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.