DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement (2008.00264v4)

Published 1 Aug 2020 in eess.AS and cs.SD

Abstract: Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

Citations (551)

View on Semantic Scholar

Summary

The paper presents DCCRN, which preserves and processes both magnitude and phase information to significantly enhance speech quality.
It employs a complex-valued convolutional encoder-decoder with LSTM layers and SI-SNR loss, outperforming traditional models in MOS and efficiency.
The approach achieves top real-time performance with only 3.7 million parameters, making it ideal for low-power and edge computing applications.

Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

The paper presents the Deep Complex Convolution Recurrent Network (DCCRN), a sophisticated architecture designed for phase-aware speech enhancement, a crucial area in audio processing. This work builds upon traditional Convolutional Recurrent Networks (CRNs), adding complexity to both convolutional and recurrent components, enabling the network to simulate complex-valued operations effectively.

Methodology

The core innovation of the DCCRN lies in its handling of complex-valued operations using both CNN and RNN structures. The authors propose a design where the magnitude and phase information of audio signals are preserved and modeled simultaneously. This is a significant departure from traditional approaches, which often discard phase information, thus potentially limiting the enhancement performance.

The DCCRN architecture consists of a convolutional encoder-decoder structure, augmented with long short-term memory (LSTM) layers for temporal modeling. The network is optimized using an SI-SNR loss function. The authors introduce several configurations, such as DCCRN-R, DCCRN-C, and DCCRN-E, to evaluate different methods of complex spectrogram estimation, and they demonstrate that these configurations outperform traditional models significantly.

Results

In terms of performance metrics, the DCCRN model showcases remarkable results. With only 3.7 million parameters, the model achieved top results in the Interspeech 2020 Deep Noise Suppression Challenge, specifically ranking first in the real-time track for Mean Opinion Score (MOS) and second in the non-real-time track. The paper highlights that the DCCRN architecture requires considerably fewer computational resources than comparable models, such as the Deep Complex U-Net (DCUNET), while maintaining competitive performance levels.

Implications and Future Work

The introduction of DCCRN contributes to the field of speech enhancement by providing a model that is both efficient and effective. The ability to operate in real-time with a limited parameter set makes DCCRN suitable for deployment in real-world applications, including low-power and edge computing environments.

The paper suggests that future work may focus on optimizing DCCRN further for deployment in constrained computing scenarios. Additionally, enhancing the model’s capabilities to address reverberation conditions more effectively could be explored, potentially improving its applicability across various environments and noise conditions.

Conclusion

This paper presents a notable advancement in the domain of speech enhancement through the introduction of DCCRN. The complex-valued operations and phase-aware processing represent important steps forward in achieving superior speech intelligibility and quality. The results underscore the potential of DCCRN as a robust tool for audio enhancement tasks, paving the way for future developments in efficient and high-performing speech processing systems.

PDF Markdown