Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation (2007.13975v3)

Published 28 Jul 2020 in eess.AS and cs.SD

Abstract: The dominant speech separation models are based on complex recurrent or convolution neural network that model speech sequences indirectly conditioning on context, such as passing information through many intermediate states in recurrent neural network, leading to suboptimal separation performance. In this paper, we propose a dual-path transformer network (DPTNet) for end-to-end speech separation, which introduces direct context-awareness in the modeling for speech sequences. By introduces a improved transformer, elements in speech sequences can interact directly, which enables DPTNet can model for the speech sequences with direct context-awareness. The improved transformer in our approach learns the order information of the speech sequences without positional encodings by incorporating a recurrent neural network into the original transformer. In addition, the structure of dual paths makes our model efficient for extremely long speech sequence modeling. Extensive experiments on benchmark datasets show that our approach outperforms the current state-of-the-arts (20.6 dB SDR on the public WSj0-2mix data corpus).

Authors (3)

Jingjing Chen (99 papers)
Qirong Mao (10 papers)
Dong Liu (267 papers)

Citations (262)

View on Semantic Scholar

Summary

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation

The paper "Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation" introduces an innovative neural network architecture aimed at enhancing monaural speech separation. The authors propose a Dual-Path Transformer Network (DPTNet) that seeks to overcome limitations inherent in conventional recurrent and convolutional neural networks, which often suffer from indirect context modeling due to intermediate state dependencies.

Core Contributions

Direct Context-Aware Modeling: The primary innovation of the DPTNet lies in its ability to facilitate direct context-aware interactions between elements in speech sequences. This is achieved by leveraging an improved transformer architecture that integrates recurrent neural network components, thereby obviating the need for traditional positional encodings.
Efficient Long Sequence Modeling: By employing a dual-path structure, DPTNet efficiently handles extremely long speech sequences, which are challenging for standard transformer models due to their computational constraints when handling large inputs.
Empirical Validation: DPTNet demonstrates superior performance over state-of-the-art models, achieving a signal-to-distortion ratio (SDR) of 20.6 dB on the widely utilized WSJ0-2mix dataset. This presents a significant increase over previous methods, highlighting the effectiveness of direct context-aware modeling.

Technical Approach

The technical contribution of DPTNet involves the incorporation of both local and global sequence modeling via a novel dual-path mechanism. The dual-path structure consists of intra-transformers for local chunk processing and inter-transformers for learning global dependencies across chunks. This layered approach enables optimal direct interactions for speech elements while maintaining computational efficiency.

One intriguing aspect of this model is its deviation from conventional transformer input handling. By incorporating an RNN within the transformer's position-wise feed-forward network, DPTNet effectively learns order information without relying on positional encodings, sidestepping the potential instability these encodings can introduce during model training.

Experimental Findings

The experimentation included standard datasets WSJ0-2mix and LS-2mix, where DPTNet was evaluated against established benchmarks such as Conv-TasNet and DPRNN. The experiments convincingly indicate that DPTNet outperforms these methods in terms of SI-SNR and SDR across different datasets, maintaining a compact model size. This success can be attributed to the enhanced modeling of sequential relationships within the input data, facilitated by the direct context-aware mechanism.

Implications and Future Work

The practical implications of this research are significant for applications requiring robust speech separation in varied acoustic environments. The ability of the DPTNet to separate clean speech from complex acoustic mixtures enhances automatic speech recognition systems and other audio signal processing applications.

Future directions for this research may include exploring the removal of the dual-path structure to develop a framework capable of directly modeling long sequences without architectural partitioning. This could potentially lead to further improvements in separation efficiency and efficacy.

In conclusion, the Dual-Path Transformer Network presents a compelling advancement in the field of monaural speech separation, setting a new performance benchmark while paving the way for future explorations into more efficient and effective speech sequence processing methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - ujscjj/DPTNet (100 stars)