Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers (2403.07675v2)

Published 12 Mar 2024 in cs.SD and eess.AS

Abstract: In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a structured-state-space-based RNN-like network. Moreover, we investigate the length extrapolation ability of different networks, namely test on signals that are much longer than training signals, and propose a short-signal training plus long-signal fine-tuning strategy, which largely improves the length extrapolation ability of the networks within limited training time. Overall, the proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams, and for both static and moving speakers. The proposed method is open-sourced in https://github.com/Audio-WestlakeU/NBSS.

References (37)

Authors (2)

Changsheng Quan (7 papers)
Xiaofei Li (71 papers)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces an online extension of SpatialNet that efficiently captures spatial cues for both static and moving speakers.
It replaces offline self-attention with three streaming variants—masked self-attention, Retention, and Mamba—for linear inference complexity.
Experimental results show improved speech enhancement and robust long-term performance through a short-signal training plus long-signal fine-tuning strategy.

Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers

This paper addresses the challenge of multichannel long-term streaming speech enhancement in both static and moving speaker scenarios. The research extends the previously proposed offline SpatialNet into an online format, which remains computationally efficient when dealing with lengthy audio streams. The core innovation lies in leveraging spatial information to distinguish target speech from interferences, particularly in distinguishing between static and moving speakers.

SpatialNet utilizes a narrow-band self-attention module to learn the temporal dynamics of spatial vectors. However, transitioning to a streaming model necessitates modifications. The authors propose replacing the offline self-attention network with online networks that have linear inference complexity concerning signal length while retaining the capacity for learning long-term information. Three variants were developed based on advanced techniques: masked self-attention (MSA), Retention (a variant of self-attention), and Mamba (a structured state-space-based RNN-like network).

The methodology also includes investigating the networks' ability to extrapolate signal length. This is addressed by employing a training strategy known as short-signal training plus long-signal fine-tuning (ST+LF), which enhances the networks' length extrapolation capability with minimal training time.

Model Variants

Masked Self-Attention (MSA):
- Utilizes time-restrictive masking to facilitate streaming processing by limiting self-attention to past time steps within a set memory window.
Retention:
- Presenting itself as a linearized self-attention variant, Retention compresses historic context into a state matrix, enabling efficient querying operations while maintaining lower computational complexity.
Mamba:
- This model functions akin to an RNN, utilizing input-dependent parameters for selectively processing substantial historic information compressed in a state space. Its structure leverages continuous-time state-space formulations to optimize input selection within sequences.

Experimental Evaluation

The paper's experiments focus on simulated datasets that feature both static and moving speaker scenarios. Model performance is evaluated over these datasets, leveraging the improved speech enhancement under variable conditions. Significantly, the models are trained using a robust ST+LF strategy, indicating competitive outcomes concerning length extrapolation and enhanced computational efficiency.

Results illustrate that the proposed online SpatialNet variants notably outperform existing online methods such as McNet and EaBNet by integrating additional spatial information. Routing the improvements through enhanced network architecture and strategic training procedures, the online variants retain superior performance over substantial input lengths.

Implications and Future Work

The implications of this research are pertinent for real-world speech enhancement applications where audio inputs can be lengthy and speakers might shift positions within an acoustic environment. Beyond practical implementations, the theoretical framework enhances the understanding of spatial information's role in speech separation tasks.

Future research directions include further optimizing the proposed networks for real-time applications and investigating improvements in the adaptive mechanisms that guide input selection, especially for dynamic environments with multiple moving speakers. Additionally, broader evaluation over diverse acoustic scenarios could refine understanding and model performance metrics further.

In conclusion, this work contributes to the ongoing development of speech enhancement technologies by advancing the adaptability and computational efficiency of neural networks in processing long-term multichannel audio input.

PDF Markdown

Related Papers

GitHub

GitHub - Audio-WestlakeU/NBSS: The official repo of NBC & SpatialNet for multichannel speech separation, denoising, and dereverberation (165 stars)

Tweets

https://twitter.com/ArxivSound/status/1767763596031725872

https://twitter.com/AudioAndSpeech/status/1767787071492563184