Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers (2403.07675v2)

Published 12 Mar 2024 in cs.SD and eess.AS

Abstract: In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a structured-state-space-based RNN-like network. Moreover, we investigate the length extrapolation ability of different networks, namely test on signals that are much longer than training signals, and propose a short-signal training plus long-signal fine-tuning strategy, which largely improves the length extrapolation ability of the networks within limited training time. Overall, the proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams, and for both static and moving speakers. The proposed method is open-sourced in https://github.com/Audio-WestlakeU/NBSS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in ICASSP, 2016, pp. 31–35.
  2. D. Yu, M. Kolbaek, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in ICASSP, Mar. 2017, pp. 241–245.
  3. H. Chen, Y. Yang, F. Dang, and P. Zhang, “Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output,” in Interspeech, 2022, pp. 866–870.
  4. Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “Tf-gridnet: Integrating full- and sub-band modeling for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3221–3236, 2023.
  5. T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.
  6. S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 692–730, Apr. 2017.
  7. J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge,” in ASRU, 2015, pp. 444–451.
  8. T. Ochiai, M. Delcroix, R. Ikeshita, K. Kinoshita, T. Nakatani, and S. Araki, “Beam-TasNet: Time-domain Audio Separation Network Meets Frequency-domain Beamformer,” in ICASSP, May 2020, pp. 6384–6388.
  9. R. Gu, S.-X. Zhang, Y. Zou, and D. Yu, “Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 849–862, 2023.
  10. Y. Wang, A. Politis, and T. Virtanen, “Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios,” Dec. 2023. [Online]. Available: http://arxiv.org/abs/2312.10756
  11. T. Ochiai, M. Delcroix, T. Nakatani, and S. Araki, “Mask-Based Neural Beamforming for Moving Speakers With Self-Attention-Based Tracking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 835–848, 2023.
  12. K. Tesch and T. Gerkmann, “Insights Into Deep Non-Linear Filters for Improved Multi-Channel Speech Enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 563–575, 2023.
  13. C. Quan and X. Li, “SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1310–1323, 2024.
  14. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in ICASSP, 2020, pp. 46–50.
  15. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  16. Y. Yang, C. Quan, and X. Li, “MCNET: Fuse Multiple Cues for Multichannel Speech Enhancement,” in ICASSP, Jun. 2023, pp. 1–5.
  17. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008.
  18. Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive Network: A Successor to Transformer for Large Language Models,” Aug. 2023. [Online]. Available: http://arxiv.org/abs/2307.08621
  19. A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” Dec. 2023. [Online]. Available: http://arxiv.org/abs/2312.00752
  20. X. Li, L. Girin, S. Gannot, and R. Horaud, “Multichannel Online Dereverberation based on Spectral Magnitude Inverse Filtering,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1365–1377, Sep. 2019.
  21. T. Yoshioka and T. Nakatani, “Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2707–2720, Dec. 2012.
  22. T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903–907, Jun. 2019.
  23. S. Winter, W. Kellermann, H. Sawada, and S. Makino, “MAP-Based Underdetermined Blind Source Separation of Convolutive Mixtures by Hierarchical Clustering and l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Norm Minimization,” EURASIP Journal on Advances in Signal Processing, vol. 2007, no. 1, pp. 1–12, Dec. 2006.
  24. C. Boeddecker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, and R. Haeb-Umbach, “Front-end processing for the CHiME-5 dinner party scenario,” in CHiME, Sep. 2018, pp. 35–40.
  25. Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in ICASSP, 2020, pp. 7829–7833.
  26. N. Moritz, T. Hori, and J. Le, “Streaming Automatic Speech Recognition with the Transformer Model,” in ICASSP.   Barcelona, Spain: IEEE, May 2020, pp. 6074–6078.
  27. J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “RoFormer: Enhanced transformer with Rotary Position Embedding,” Neurocomputing, vol. 568, p. 127063, Feb. 2024.
  28. Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang, A. Benhaim, V. Chaudhary, X. Song, and F. Wei, “A Length-Extrapolatable Transformer,” in ACL, 2023, pp. 14 590–14 604.
  29. J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in ASRU, Dec. 2015, pp. 504–511.
  30. E. A. Lehmann and A. M. Johansson, “Diffuse Reverberation Model for Efficient Image-Source Simulation of Room Impulse Responses,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1429–1439, Aug. 2010.
  31. D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
  32. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International conference on learning representations, 2019.
  33. J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in ICASSP, May 2019, pp. 626–630.
  34. A. Li, W. Liu, C. Zheng, and X. Li, “Embedding and Beamforming: All-Neural Causal Beamformer for Multichannel Speech Enhancement,” in ICASSP, May 2022, pp. 6487–6491.
  35. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP, 2001, pp. 749–752.
  36. J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016.
  37. E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1462–1469, Jul. 2006.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Changsheng Quan (7 papers)
  2. Xiaofei Li (71 papers)
Citations (18)

Summary

  • The paper introduces an online extension of SpatialNet that efficiently captures spatial cues for both static and moving speakers.
  • It replaces offline self-attention with three streaming variants—masked self-attention, Retention, and Mamba—for linear inference complexity.
  • Experimental results show improved speech enhancement and robust long-term performance through a short-signal training plus long-signal fine-tuning strategy.

Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers

This paper addresses the challenge of multichannel long-term streaming speech enhancement in both static and moving speaker scenarios. The research extends the previously proposed offline SpatialNet into an online format, which remains computationally efficient when dealing with lengthy audio streams. The core innovation lies in leveraging spatial information to distinguish target speech from interferences, particularly in distinguishing between static and moving speakers.

SpatialNet utilizes a narrow-band self-attention module to learn the temporal dynamics of spatial vectors. However, transitioning to a streaming model necessitates modifications. The authors propose replacing the offline self-attention network with online networks that have linear inference complexity concerning signal length while retaining the capacity for learning long-term information. Three variants were developed based on advanced techniques: masked self-attention (MSA), Retention (a variant of self-attention), and Mamba (a structured state-space-based RNN-like network).

The methodology also includes investigating the networks' ability to extrapolate signal length. This is addressed by employing a training strategy known as short-signal training plus long-signal fine-tuning (ST+LF), which enhances the networks' length extrapolation capability with minimal training time.

Model Variants

  1. Masked Self-Attention (MSA):
    • Utilizes time-restrictive masking to facilitate streaming processing by limiting self-attention to past time steps within a set memory window.
  2. Retention:
    • Presenting itself as a linearized self-attention variant, Retention compresses historic context into a state matrix, enabling efficient querying operations while maintaining lower computational complexity.
  3. Mamba:
    • This model functions akin to an RNN, utilizing input-dependent parameters for selectively processing substantial historic information compressed in a state space. Its structure leverages continuous-time state-space formulations to optimize input selection within sequences.

Experimental Evaluation

The paper's experiments focus on simulated datasets that feature both static and moving speaker scenarios. Model performance is evaluated over these datasets, leveraging the improved speech enhancement under variable conditions. Significantly, the models are trained using a robust ST+LF strategy, indicating competitive outcomes concerning length extrapolation and enhanced computational efficiency.

Results illustrate that the proposed online SpatialNet variants notably outperform existing online methods such as McNet and EaBNet by integrating additional spatial information. Routing the improvements through enhanced network architecture and strategic training procedures, the online variants retain superior performance over substantial input lengths.

Implications and Future Work

The implications of this research are pertinent for real-world speech enhancement applications where audio inputs can be lengthy and speakers might shift positions within an acoustic environment. Beyond practical implementations, the theoretical framework enhances the understanding of spatial information's role in speech separation tasks.

Future research directions include further optimizing the proposed networks for real-time applications and investigating improvements in the adaptive mechanisms that guide input selection, especially for dynamic environments with multiple moving speakers. Additionally, broader evaluation over diverse acoustic scenarios could refine understanding and model performance metrics further.

In conclusion, this work contributes to the ongoing development of speech enhancement technologies by advancing the adaptability and computational efficiency of neural networks in processing long-term multichannel audio input.