Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers (2403.07675v2)
Abstract: In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a structured-state-space-based RNN-like network. Moreover, we investigate the length extrapolation ability of different networks, namely test on signals that are much longer than training signals, and propose a short-signal training plus long-signal fine-tuning strategy, which largely improves the length extrapolation ability of the networks within limited training time. Overall, the proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams, and for both static and moving speakers. The proposed method is open-sourced in https://github.com/Audio-WestlakeU/NBSS.
- J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in ICASSP, 2016, pp. 31–35.
- D. Yu, M. Kolbaek, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in ICASSP, Mar. 2017, pp. 241–245.
- H. Chen, Y. Yang, F. Dang, and P. Zhang, “Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output,” in Interspeech, 2022, pp. 866–870.
- Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “Tf-gridnet: Integrating full- and sub-band modeling for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023.
- T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.
- S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 692–730, Apr. 2017.
- J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge,” in ASRU, 2015, pp. 444–451.
- T. Ochiai, M. Delcroix, R. Ikeshita, K. Kinoshita, T. Nakatani, and S. Araki, “Beam-TasNet: Time-domain Audio Separation Network Meets Frequency-domain Beamformer,” in ICASSP, May 2020, pp. 6384–6388.
- R. Gu, S.-X. Zhang, Y. Zou, and D. Yu, “Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 849–862, 2023.
- Y. Wang, A. Politis, and T. Virtanen, “Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios,” Dec. 2023. [Online]. Available: http://arxiv.org/abs/2312.10756
- T. Ochiai, M. Delcroix, T. Nakatani, and S. Araki, “Mask-Based Neural Beamforming for Moving Speakers With Self-Attention-Based Tracking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 835–848, 2023.
- K. Tesch and T. Gerkmann, “Insights Into Deep Non-Linear Filters for Improved Multi-Channel Speech Enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 563–575, 2023.
- C. Quan and X. Li, “SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1310–1323, 2024.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in ICASSP, 2020, pp. 46–50.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- Y. Yang, C. Quan, and X. Li, “MCNET: Fuse Multiple Cues for Multichannel Speech Enhancement,” in ICASSP, Jun. 2023, pp. 1–5.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008.
- Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive Network: A Successor to Transformer for Large Language Models,” Aug. 2023. [Online]. Available: http://arxiv.org/abs/2307.08621
- A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” Dec. 2023. [Online]. Available: http://arxiv.org/abs/2312.00752
- X. Li, L. Girin, S. Gannot, and R. Horaud, “Multichannel Online Dereverberation based on Spectral Magnitude Inverse Filtering,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1365–1377, Sep. 2019.
- T. Yoshioka and T. Nakatani, “Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2707–2720, Dec. 2012.
- T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903–907, Jun. 2019.
- S. Winter, W. Kellermann, H. Sawada, and S. Makino, “MAP-Based Underdetermined Blind Source Separation of Convolutive Mixtures by Hierarchical Clustering and l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Norm Minimization,” EURASIP Journal on Advances in Signal Processing, vol. 2007, no. 1, pp. 1–12, Dec. 2006.
- C. Boeddecker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, and R. Haeb-Umbach, “Front-end processing for the CHiME-5 dinner party scenario,” in CHiME, Sep. 2018, pp. 35–40.
- Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in ICASSP, 2020, pp. 7829–7833.
- N. Moritz, T. Hori, and J. Le, “Streaming Automatic Speech Recognition with the Transformer Model,” in ICASSP. Barcelona, Spain: IEEE, May 2020, pp. 6074–6078.
- J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “RoFormer: Enhanced transformer with Rotary Position Embedding,” Neurocomputing, vol. 568, p. 127063, Feb. 2024.
- Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang, A. Benhaim, V. Chaudhary, X. Song, and F. Wei, “A Length-Extrapolatable Transformer,” in ACL, 2023, pp. 14 590–14 604.
- J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in ASRU, Dec. 2015, pp. 504–511.
- E. A. Lehmann and A. M. Johansson, “Diffuse Reverberation Model for Efficient Image-Source Simulation of Room Impulse Responses,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1429–1439, Aug. 2010.
- D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International conference on learning representations, 2019.
- J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in ICASSP, May 2019, pp. 626–630.
- A. Li, W. Liu, C. Zheng, and X. Li, “Embedding and Beamforming: All-Neural Causal Beamformer for Multichannel Speech Enhancement,” in ICASSP, May 2022, pp. 6487–6491.
- A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP, 2001, pp. 749–752.
- J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016.
- E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1462–1469, Jul. 2006.
- Changsheng Quan (7 papers)
- Xiaofei Li (71 papers)