CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation (2403.03411v1)
Abstract: We introduce CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.
- D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, pp. 1702–1726, 2018.
- J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35.
- M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, pp. 1901–1913, 2017.
- Y. Liu and D. Wang, “Divide and conquer: A deep casa approach to talker-independent monaural speaker separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, pp. 2092–2102, 2019.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, pp. 1256–1266, 2019.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. ICASSP, 2020, pp. 46–50.
- C. Subakan, M. Ravanelli, S. Cornell, F. Grondin, and M. Bronzi, “Exploring self-attention mechanisms for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 2169–2180, 2023.
- H. Taherian, K. Tan, and D. Wang, “Multi-channel talker-independent speaker separation through location-based training,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 2791–2800, 2022.
- Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Integrating full-and sub-band modeling for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., pp. 3221–3236, 2023.
- D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, pp. 483–492, 2016.
- S.-W. Fu, T.-y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhancement by convolutional neural network with multi-metrics learning,” in Proc. MLSP, 2017, pp. 1–6.
- K. Tan and D. Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 380–390, 2019.
- Z.-Q. Wang and D. Wang, “Deep learning based target cancellation for speech dereverberation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 941–950, 2020.
- C. Quan and X. Li, “Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 1310–1323, 2024.
- H. Taherian and D. Wang, “Multi-channel conversational speaker separation via neural diarization,” arXiv:2311.08630, 2023.
- J. Rosendahl, V. A. K. Tran, W. Wang, and H. Ney, “Analysis of positional encodings for neural machine translation,” in Proc. Spoken Language Translation, 2019.
- J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” arXiv:2007.13975, 2020.
- F. Andayani, L. B. Theng, M. T. Tsun, and C. Chua, “Hybrid LSTM-transformer model for emotion recognition from speech audio files,” IEEE Access, vol. 10, pp. 36 018–36 027, 2022.
- K. Tan, Z.-Q. Wang, and D. Wang, “Neural spectrospatial filtering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 605–621, 2022.
- A. Ruoss, G. Delétang, T. Genewein, J. Grau-Moya, R. Csordás, M. Bennani, S. Legg, and J. Veness, “Randomized positional encodings boost length generalization of transformers,” arXiv:2305.16843, 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Adv. Neural Info. Proc. Sys., vol. 30, 2017.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv:2005.08100, 2020.
- J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in Proc. ICASSP, 2019, pp. 626–630.
- Z. Pan, M. Ge, and H. Li, “A hybrid continuity loss to reduce over-suppression for time-domain target speaker extraction,” arXiv:2203.16843, 2022.
- D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017, pp. 241–245.
- N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2840–2849, 2021.
- M. Maciejewski, G. Wichern, E. McQuinn, and J. Le Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” in Proc. ICASSP, 2020, pp. 696–700.
- L. Drude, J. Heitkaemper, C. Boeddeker, and R. Haeb-Umbach, “SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition,” arXiv:1910.13934, 2019.
- J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, pp. 943–950, 1979.
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in Proc. Workshop ASRU, 2011.
- E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, pp. 1462–1469, 2006.
- A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSSP, vol. 2, 2001, pp. 749–752.
- J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Tran. Audio, Speech, Lang. Process., vol. 24, pp. 2009–2022, 2016.
- N. S. Detlefsen, J. Borovec, J. Schock, A. H. Jha, T. Koker, L. Di Liello, D. Stancl, C. Quan, M. Grechkin, and W. Falcon, “TorchMetrics-measuring reproducibility in pytorch,” Journal of Open Source Software, vol. 7, p. 4101, 2022.
- Z.-Q. Wang, “ESPnet2 pretrained model,” 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7565926
- L. Zhang, Z. Shi, J. Han, A. Shi, and D. Ma, “FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks,” in Proc. MultiMedia Modeling. Springer, 2020, pp. 653–665.
- E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo Rm-Rf: Efficient networks for universal audio source separation,” in Proc. MLSP, 2020, pp. 1–6.
- Y. Zhu, X. Zheng, X. Wu, W. Liu, L. Pi, and M. Chen, “DPTCN-ATPP: Multi-scale end-to-end modeling for single-channel speech separation,” in 2021 5th International Conference on Communication and Information Systems (ICCIS), 2021, pp. 39–44.
- M. W. Lam, J. Wang, D. Su, and D. Yu, “Sandglasset: A light multi-granularity self-attentive network for time-domain speech separation,” in Proc. ICASSP, 2021, pp. 5759–5763.
- L. Yang, W. Liu, and W. Wang, “TFPSNet: Time-frequency domain path scanning network for speech separation,” in Proc. ICASSP, 2022, pp. 6842–6846.
- J. Rixen and M. Renz, “SFSRNet: Super-resolution for single-channel audio source separation,” in Proc. AAAI Artificial Intelligence, vol. 36, 2022, pp. 11 220–11 228.
- ——, “QDPN: Quasi-dual-path network for single-channel speech separation,” in Proc. Interspeech, 2022, pp. 5353–5357.
- S. Zhao and B. Ma, “MossFormer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” in Proc. ICASSP, 2023, pp. 1–5.
- Z.-Q. Wang, P. Wang, and D. Wang, “Multi-microphone complex spectral mapping for utterance-wise and continuous speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2001–2014, 2021.
- Z.-Q. Wang, G. Wichern, and J. Le Roux, “Convolutive prediction for reverberant speech separation,” in Proc. WASPAA, 2021, pp. 56–60.
- Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” in Proc. ICASSP, 2020, pp. 6394–6398.
- J. Zhang, C. Zorilă, R. Doddipatla, and J. Barker, “On end-to-end multi-channel time domain speech separation in reverberant environments,” in Proc. ICASSP, 2020, pp. 6389–6393.
- X. Li and R. Horaud, “Narrow-band deep filtering for multichannel speech enhancement,” arXiv:1911.10791, 2019.
- K. Tesch and T. Gerkmann, “Insights into deep non-linear filters for improved multi-channel speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 563–575, 2022.
- Y. Yang, C. Quan, and X. Li, “Mcnet: Fuse multiple cues for multichannel speech enhancement,” in Proc. ICASSP, 2023, pp. 1–5.
- S. Wang, X. Kong, X. Peng, H. Movassagh, V. Prakash, and Y. Lu, “DASFormer: Deep alternating spectrogram transformer for multi/single-channel speech separation,” in Proc. ICASSP, 2023, pp. 1–5.
- Vahid Ahmadi Kalkhorani (4 papers)
- DeLiang Wang (43 papers)