FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement (2010.15508v2)

Published 29 Oct 2020 in eess.AS, cs.SD, and eess.SP

Abstract: This paper proposes a full-band and sub-band fusion model, named as FullSubNet, for single-channel real-time speech enhancement. Full-band and sub-band refer to the models that input full-band and sub-band noisy spectral feature, output full-band and sub-band speech target, respectively. The sub-band model processes each frequency independently. Its input consists of one frequency and several context frequencies. The output is the prediction of the clean speech target for the corresponding frequency. These two types of models have distinct characteristics. The full-band model can capture the global spectral context and the long-distance cross-band dependencies. However, it lacks the ability to modeling signal stationarity and attending the local spectral pattern. The sub-band model is just the opposite. In our proposed FullSubNet, we connect a pure full-band model and a pure sub-band model sequentially and use practical joint training to integrate these two types of models' advantages. We conducted experiments on the DNS challenge (INTERSPEECH 2020) dataset to evaluate the proposed method. Experimental results show that full-band and sub-band information are complementary, and the FullSubNet can effectively integrate them. Besides, the performance of the FullSubNet also exceeds that of the top-ranked methods in the DNS Challenge (INTERSPEECH 2020).

PDF Abstract

FullSubNet: A Fusion Model for Real-Time Single-Channel Speech Enhancement

The paper introduces FullSubNet, a robust methodological innovation aimed at improving real-time single-channel speech enhancement. FullSubNet combines the strengths of full-band and sub-band models to address the shortcomings of each approach when used in isolation. Full-band models, while effective at capturing the global spectral context and long-distance cross-band dependencies, fail to adequately model signal stationarity or focus on local spectral patterns. Sub-band models, on the other hand, adeptly manage these tasks but lack sensitivity to full-band spectral information and cross-band dependencies. Therefore, the sequential integration of these two models in FullSubNet represents a logical synthesis of their individual advantages, unified through practical joint training.

Methodology Overview

In FullSubNet, the processing workflow employs two primary components: a pure full-band model and a pure sub-band model. The full-band model accepts the magnitude spectral features and outputs a spectral embedding, relaying this information to the sub-band model that processes it along with sub-band units formed from the target frequency and adjacent frequency contexts. This combinatorial strategy enriches the learning environment, enhancing the sub-band model's ability to discern between speech and stationary noise by utilizing signal stationarity. The adoption of the complex Ideal Ratio Mask (cIRM) as the learning target further optimizes the model's efficiency, focusing on phase-agnostic denoising that suits real-time applications with restricted computational overhead.

Experimental Evaluation

Extensive empirical testing performed on the DNS Challenge dataset for INTERSPEECH 2020 confirms the efficacy of FullSubNet. The fusion model exhibited superior performance metrics across various speech quality measures, including WB-PESQ, NB-PESQ, STOI, and SI-SDR. Notably, FullSubNet outperformed the top-ranked methods in the DNS challenge, substantiating its capability to effectively integrate and utilize full-band and sub-band information. The results affirm that the information exploited by sub-band models, in terms of local patterns and signal stationarity, provides critical complementary insights not accounted for by full-band models alone.

Implications and Future Directions

The introduction of FullSubNet represents a progressive step in speech enhancement models, proving the merit of hybrid approaches in consolidating the advantages of distinct model architectures. This foundational advancement offers valuable implications for real-world applications, potentially improving practical systems like hearing aids, telecommunications, and voice-activated technologies where real-time processing and high fidelity are critical.

Looking forward, the development of FullSubNet opens avenues for further research in AI and machine learning, especially concerning model temporal dynamics, scalability, and the exploration of parallel architectures to minimize latency. Future research could investigate if the sub-band and full-band fusion strategy could be applied to improve multi-channel speech processing tasks, potentially leading to advancements in microphone array applications or sound field reproduction systems.

In conclusion, while FullSubNet has demonstrated formidable capabilities in speech enhancement, the ongoing exploration for new methodological improvements in noise suppression and signal processing continues, with FullSubNet serving as a potential landmark in this domain.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Xiang Hao (40 papers)
Xiangdong Su (12 papers)
Radu Horaud (70 papers)
Xiaofei Li (71 papers)

Citations (176)

View on Semantic Scholar

FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement (2010.15508v2)

FullSubNet: A Fusion Model for Real-Time Single-Channel Speech Enhancement

Methodology Overview

Experimental Evaluation

Implications and Future Directions

Related Papers