End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation (1910.14104v3)

Published 30 Oct 2019 in eess.AS, cs.LG, and cs.SD

Abstract: An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based beamforming techniques satisfy these requirements by definition, while for deep learning-based end-to-end systems those constraints are not fully addressed. In this paper, we propose transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation. Based on the filter-and-sum network (FaSNet), a recently proposed end-to-end time-domain beamforming system, we show how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays. Moreover, we show that TAC also significantly improves the separation performance with fixed geometry array configuration, further proving the effectiveness of the proposed paradigm in the general problem of multi-microphone speech separation.

Citations (173)

View on Semantic Scholar

Summary

The paper presents the TAC method that achieves permutation and number invariance in multi-channel speech separation.
It integrates TAC into the FaSNet system, significantly improving SI-SNRi across both ad-hoc and fixed microphone array configurations.
The approach offers robust real-world applications in dynamic acoustic settings and paves the way for deeper integration with advanced deep learning models.

End-to-End Microphone Permutation and Number Invariant Multi-Channel Speech Separation

The paper addresses a critical challenge in the field of multi-channel speech separation: the need for systems to be invariant to both the permutation and number of microphones. This requirement is fundamental for ad-hoc microphone arrays where the spatial arrangement and number of microphones can vary significantly. Traditional optimization-based beamforming techniques inherently fulfill these constraints, but deep learning-based end-to-end systems often fall short. To bridge this gap, the researchers propose a novel method, transform-average-concatenate (TAC), which serves as a paradigm for channel permutation and number invariant multi-channel speech separation, specifically enhancing the previously developed filter-and-sum network (FaSNet).

Overview of TAC and FaSNet

The TAC method is designed to ensure invariance to microphone permutation and number by incorporating a few key steps in processing:

Transform: Channel features are initially transformed using a shared module.
Average: Outputs are pooled globally in a permutation-invariant manner via averaging.
Concatenate: The average-pooled result is then combined with each channel’s features and processed to produce outputs that respect both the number and order of inputs.

With its inherent properties, TAC effectively allows global decision-making by utilizing all available microphone information, ensuring consistent performance across varying microphone numbers and configurations.

Applied to FaSNet, an end-to-end time-domain beamforming system, TAC significantly boosts performance. FaSNet employs a two-stage process where initial separation occurs at a reference microphone, followed by a broader filtration for other channels. By integrating TAC into FaSNet, the capability to incorporate multiple microphones into a coherent processing pipeline is enhanced, thus improving separation quality.

Experimental Evaluation

The research includes a rigorous evaluation across different settings:

Ad-hoc Array Configurations: Performance is assessed with arrays featuring 2 to 6 microphones, showcasing the versatility of the TAC-enhanced FaSNet. Notably, results indicate that TAC substantially improves signal-to-interference ratio improvements (SI-SNRi) across all configurations, especially with higher overlap ratios.
Fixed Geometry Arrays: Even in structured setups like a 6-microphone circular array, incorporating TAC into FaSNet provides notable performance gains over both isolated and more traditional concatenated feature approaches. This suggests that TAC's design efficiently captures and utilizes geospatial audio cues even without explicit geometrical features.

Implications and Future Directions

This research presents a meaningful advancement in microphone array processing by creating a model naturally resilient to variations in microphone configurations. TAC's introduction into the FaSNet framework not only resolves issues related to channel permutation and number but also enhances overall robustness in challenging acoustic environments.

Looking forward, the implications of this work suggest multiple avenues for exploration in AI and audio processing:

Broad Application: The principles behind TAC can potentially extend to other domains requiring sensor permutation and number invariance, such as sensor fusion in robotics or environmental monitoring.
Real-time Implementations: Given that TAC enhances separation quality with varying numbers of microphones, developing real-time processing systems for dynamic environments like conferences or smart home systems becomes feasible.
Integration with Other Deep Learning Models: Further exploration into combining TAC with other neural architectures could yield systems capable of even more sophisticated audio scene understanding and source separation.

In conclusion, the TAC paradigm represents a substantial step towards versatile, robust audio processing systems. Its integration into FaSNet underlines its practical utility and potential for wide-ranging applications in audio and signal processing domains. Future research could build on these findings to advance both fundamental understanding and applied methodologies in mic array signal processing.