Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement (2206.07293v3)

Published 15 Jun 2022 in cs.SD and eess.AS

Abstract: Convolutional recurrent networks (CRN) integrating a convolutional encoder-decoder (CED) structure and a recurrent structure have achieved promising performance for monaural speech enhancement. However, feature representation across frequency context is highly constrained due to limited receptive fields in the convolutions of CED. In this paper, we propose a convolutional recurrent encoder-decoder (CRED) structure to boost feature representation along the frequency axis. The CRED applies frequency recurrence on 3D convolutional feature maps along the frequency axis following each convolution, therefore, it is capable of catching long-range frequency correlations and enhancing feature representations of speech inputs. The proposed frequency recurrence is realized efficiently using a feedforward sequential memory network (FSMN). Besides the CRED, we insert two stacked FSMN layers between the encoder and the decoder to model further temporal dynamics. We name the proposed framework as Frequency Recurrent CRN (FRCRN). We design FRCRN to predict complex Ideal Ratio Mask (cIRM) in complex-valued domain and optimize FRCRN using both time-frequency-domain and time-domain losses. Our proposed approach achieved state-of-the-art performance on wideband benchmark datasets and achieved 2nd place for the real-time fullband track in terms of Mean Opinion Score (MOS) and Word Accuracy (WAcc) in the ICASSP 2022 Deep Noise Suppression (DNS) challenge (https://github.com/modelscope/ClearerVoice-Studio).

Citations (68)

Summary

  • The paper introduces a new frequency recurrence mechanism to significantly improve feature representation in monaural speech enhancement.
  • It leverages 3D convolutional feature maps and FSMN layers to capture long-range frequency dependencies, resulting in enhanced PESQ, STOI, and SI-SNR scores.
  • The model shows state-of-the-art performance on DNS-2020, VoiceBank+Demand, and WSJ0 benchmarks, paving the way for robust speech recognition and noise suppression systems.

An Examination of FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement

The paper "FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement" introduces a convolutional recurrent encoder-decoder (CRED) structure aimed at advancing the field of monaural speech enhancement. Aimed at addressing the constraints in feature representation across frequency contexts present in convolutional recurrent networks (CRNs), the authors propose frequency recurrence to enhance the scope and effectiveness of speech inputs interpretations.

Methodological Innovation

The core innovation within this paper is the integration of Frequency Recurrent CRN (FRCRN), which employs 3D convolutional feature maps that travel along the frequency axis after each convolution in the CRED. This modification allows FRCRN to effectively harness long-range frequency correlations, significantly improving the feature representations of monaural speech. The frequency recurrence is efficiently implemented through the use of feedforward sequential memory networks (FSMN), which allow for complex-valued operations aimed to predict complex Ideal Ratio Masks (cIRM).

Architectural Overview

The FRCRN consists of an encoder-decoder structure augmented by frequency recurrence and time-recurrent layers. Two stacked FSMN layers further these capabilities by modeling temporal dynamics between the encoder and decoder. The encoder works to distill high-level features, while the decoder reconstructs the target speech signal. This synthesis allows the FRCRN to accurately predict and enhance speech quality by optimizing the feature extraction capabilities along the frequency axis.

Empirical Validation and Results

In its empirical investigations, the FRCRN model demonstrated state-of-the-art results across established benchmarks. On the DNS-2020 and VoiceBank+Demand datasets, the FRCRN showed superior performance metrics compared to existing solutions. For example, in an ablation paper using the WSJ0 dataset, several configurations were tested, and the FRCRN consistently outperformed the baseline models in PESQ, STOI, and SI-SNR scores. The model attained top-tier results in mean opinion scores (MOS) and word accuracy (WAcc) in the ICASSP 2022 Deep Noise Suppression challenge, consolidating its robust application.

Theoretical and Practical Implications

The introduction of frequency recurrence facilitates improved capture of long-range frequency dependencies, which could set a precedent for future speech enhancement models. By aligning complex domain operations with modern speech enhancement requirements, FRCRN overcomes limitations of previous architectures. This has practical implications for real-world applications, such as more accurate speech recognition systems and enhanced communication clarity in noisy environments.

Future Directions

Looking ahead, the methods outlined in this paper offer a promising framework for further research in AI-enhanced speech processing. The deployment of deeper and potentially even more complex networks using frequency recurrence mechanisms could open pathways to explore enhanced performance in more challenging audio environments. Additionally, further efforts in optimizing these architectures for computational efficiency might broaden their applicability in resource-constrained scenarios.

In summary, the paper presents significant advancements by refining convolutional recurrent networks with frequency recurrence, resulting in stronger feature representation and enhancement outcomes. It establishes a benchmark for both performance and methodological innovation in the area of monaural speech enhancement, with broad implications for both academic inquiry and practical applications in noise-suppressed communications and speech recognition technologies.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com