Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation (1910.06379v2)

Published 14 Oct 2019 in eess.AS, cs.LG, and cs.SD

Abstract: Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods. Unlike the time-frequency domain approaches, the time-domain separation systems often receive input sequences consisting of a huge number of time steps, which introduces challenges for modeling extremely long sequences. Conventional recurrent neural networks (RNNs) are not effective for modeling such long sequences due to optimization difficulties, while one-dimensional convolutional neural networks (1-D CNNs) cannot perform utterance-level sequence modeling when its receptive field is smaller than the sequence length. In this paper, we propose dual-path recurrent neural network (DPRNN), a simple yet effective method for organizing RNN layers in a deep structure to model extremely long sequences. DPRNN splits the long sequential input into smaller chunks and applies intra- and inter-chunk operations iteratively, where the input length can be made proportional to the square root of the original sequence length in each operation. Experiments show that by replacing 1-D CNN with DPRNN and apply sample-level modeling in the time-domain audio separation network (TasNet), a new state-of-the-art performance on WSJ0-2mix is achieved with a 20 times smaller model than the previous best system.

Citations (701)

Summary

  • The paper introduces DPRNN as a novel dual-path strategy that divides long sequences into chunks for effective intra- and inter-chunk processing.
  • It employs RNNs to concurrently capture local and global dependencies, reducing computational complexity to the square root of the original sequence length.
  • Experimental results on WSJ0-2mix show an 18.8 dB SI-SNR improvement with a model 20 times smaller than previous systems, setting a new state-of-the-art.

Dual-path RNN: Efficient Long Sequence Modeling for Time-domain Single-channel Speech Separation

The paper "Dual-path RNN: Efficient Long Sequence Modeling for Time-domain Single-channel Speech Separation" introduces a novel approach to address the challenges associated with modeling extremely long input sequences in time-domain speech separation. The authors propose the Dual-path Recurrent Neural Network (DPRNN), a structured integration of recurrent neural network (RNN) layers designed to enhance sequence modeling by utilizing both local and global dependencies within audio data.

Core Contributions

The primary contribution of this paper is the introduction of DPRNN, a new architectural framework that efficiently models long sequences by exploiting a dual-path strategy. This involves dividing the input sequence into smaller chunks and applying recurrent operations both within each chunk and across all chunks. The DPRNN architecture is specifically designed to overcome limitations of conventional RNNs and one-dimensional convolutional neural networks (1-D CNNs), which struggle with optimization and effective sequence-level dependency modeling when dealing with extensive temporal data.

Technical Details

DPRNN utilizes two types of RNNs: intra-chunk RNNs that process data within individual chunks and inter-chunk RNNs that aggregate information globally across chunks. This model significantly reduces the effective input length to a sublinear form, proportional to the square root of the original sequence length, thereby decreasing optimization complexity. The architecture allows the model to fully exploit sequence-level dependencies that are critical for superior speech separation.

By employing a dual-path strategy, DPRNN is capable of circumventing limitations in conventional methods such as TCNs that perform strictly local modeling due to fixed receptive fields. The proposed approach both models local dependencies effectively through intra-chunk operations and captures global dependencies via inter-chunk processes, substantially enhancing separation performance.

Experimental Results

The authors present extensive evaluations of their methodology on the WSJ0-2mix dataset, demonstrating that the DPRNN-TasNet configuration achieves a new state-of-the-art in performance, with an SI-SNR improvement of 18.8 dB. Notably, this performance is achieved with a model size 20 times smaller than the previous leading system, FurcaNeXt. Furthermore, the implementation of sample-level separation achieves superior performance, underscoring the practicality and efficiency of DPRNN in complex scenarios.

The paper also explores the efficacy of DPRNN in noisy reverberant environments, showing its capability in maintaining high signal fidelity and speech recognition accuracy. Compared to TCN-based models, DPRNN significantly improves SI-SNR and reduces word error rates (WER) in challenging conditions.

Implications and Future Directions

The DPRNN architecture presents significant implications for the field of speech separation and potentially other domains requiring efficient modeling of long sequences. By demonstrating considerable gains in performance with diminished model complexities, DPRNN sets a precedent for future advancements in sequence modeling techniques.

The approach underscores the importance of integrating hierarchical processing within networks to balance local and global sequence information adeptly, suggesting nuanced enhancements in other RNN-based applications. Future research could include adapting DPRNN to broader AI tasks, exploring its application to other time-series forecasting problems, and refining the model for on-device implementations given its reduced computational demands.

In conclusion, this paper contributes a valuable method for efficient long sequence modeling, with substantial empirical evidence supporting its utility and effectiveness in audio-related tasks. The dual-path strategy warrants further exploration and could catalyze future advancements in recurrent network architectures and sequence processing efficiencies.