LaFurca: Iterative Refined Speech Separation Based on Context-Aware Dual-Path Parallel Bi-LSTM (2001.08998v4)

Published 23 Jan 2020 in cs.SD and eess.AS

Abstract: Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}. In this paper, we propose several improvements of dual-path BiLSTM based network for end-to-end approach to monaural speech separation. Firstly a dual-path network with intra-parallel BiLSTM and inter-parallel BiLSTM components is introduced to reduce performance sub-variances among different branches. Secondly, we propose to use global context aware inter-intra cross-parallel BiLSTM to further perceive the global contextual information. Finally, a spiral multi-stage dual-path BiLSTM is proposed to iteratively refine the separation results of the previous stages. All these networks take the mixed utterance of two speakers and map it to two separate utterances, where each utterance contains only one speaker's voice. For the objective, we propose to train the network by directly optimizing the utterance level scale-invariant signal-to-distortion ratio (SI-SDR) in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-2mix data corpus results in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ, and 94.86\% of ESTOI, which shows our proposed networks can lead to performance improvement on the speaker separation task. We have open-sourced our re-implementation of the DPRNN-TasNet in https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation, and our LaFurca is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be reproduced with ease.

Authors (3)

Ziqiang Shi (27 papers)
Rujie Liu (20 papers)
Jiqing Han (26 papers)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a novel dual-path parallel BiLSTM architecture that iteratively refines speech separation for enhanced robustness.
The model leverages intra- and inter-parallel BiLSTM layers with global context awareness to effectively capture sequence information and reduce separation errors.
Experiments on the WSJ0-2mix dataset show significant improvements in SDR, SI-SDR, PESQ, and ESTOI metrics, confirming the method's high efficacy.

An Analysis of "LaFurca: Iterative Refined Speech Separation Based on Context-Aware Dual-Path Parallel Bi-LSTM"

This paper presents an advancement in the domain of multi-talker monaural speech separation, a problem integral to fields like robust speech recognition in noisy environments. The authors leverage deep learning methodologies with an innovative dual-path parallel BiLSTM network architecture named "LaFurca" for this task.

Key Contributions

The paper contributes several enhancements based on BiLSTM for effectively tackling the challenges in monaural speech separation:

Dual-Path Network Augmentation: A combination of intra-parallel BiLSTM and inter-parallel BiLSTM architectures decreases the performance variability across network branches. This setup aids in modeling sequence information more robustly.
Global Contextual Awareness: The use of global context-aware inter-intra cross-parallel BiLSTM allows the network to capture more extensive contextual information, which is crucial for distinguishing overlapping speech components from a single audio channel.
Spiral Multi-Stage Refinement: By iteratively refining separation results over multiple stages, the proposed network effectively reduces errors that can occur in earlier stages, thus improving the final output quality.

Numerical Results

The experiments conducted on the WSJ0-2mix dataset yield significant numerical improvements for the proposed method. Specifically, the LaFurca model achieves a 20.55 dB SDR improvement, a 20.35 dB SI-SDR improvement, a PESQ score of 3.69, and an ESTOI percentage of 94.86%. These results indicate that LaFurca achieves superior separation performance compared to contemporary methodologies.

Methodological Implications

The modifications implemented in LaFurca are primarily targeted at enhancing sequence modeling capabilities in the time domain. By integrating various parallel processing techniques and sequence-wise context assimilation, the network can potentially adapt better to dynamic and complex input scenarios. Additionally, these refinements highlight the potential for further exploring iterative improvement strategies in neural network-based sequence processing tasks beyond speech separation.

Practical Implications

In practical applications, the capability to extract clean signals from mixed audios can enhance the performance of downstream tasks like speech recognition and speaker verification systems, especially in challenging acoustic environments typical of real-world settings.

Speculation on Future Directions

The introduction of sophisticated architectures such as LaFurca may spur further examination of multi-level, context-aware models in the field of audio processing and AI in general. Future research could explore scaling this approach to handle more complex mixtures, involving more than two speakers or incorporating background noises common in real-world audio scenes. Integration with other end-to-end learning pipelines may also be feasible, offering potential gains in various audio-processing applications.

In conclusion, this paper presents novel and robust enhancements to BiLSTM-based models for monaural speech separation, yielding measurable improvements in performance metrics on a benchmark dataset. By employing ensemble methods and context-aware strategies, the research positions itself as a significant contribution to the evolution of neural network architectures in sequence learning tasks.

Related Papers

GitHub

GitHub - ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation: A PyTorch implementation of dual-path RNNs (DPRNNs) based speech separation described in "Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation". (167 stars)