DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering (2110.05588v2)

Published 11 Oct 2021 in eess.AS, cs.LG, and eess.SP

Abstract: Complex-valued processing has brought deep learning-based speech enhancement and signal extraction to a new level. Typically, the process is based on a time-frequency (TF) mask which is applied to a noisy spectrogram, while complex masks (CM) are usually preferred over real-valued masks due to their ability to modify the phase. Recent work proposed to use a complex filter instead of a point-wise multiplication with a mask. This allows to incorporate information from previous and future time steps exploiting local correlations within each frequency band. In this work, we propose DeepFilterNet, a two stage speech enhancement framework utilizing deep filtering. First, we enhance the spectral envelope using ERB-scaled gains modeling the human frequency perception. The second stage employs deep filtering to enhance the periodic components of speech. Additionally to taking advantage of perceptual properties of speech, we enforce network sparsity via separable convolutions and extensive grouping in linear and recurrent layers to design a low complexity architecture. We further show that our two stage deep filtering approach outperforms complex masks over a variety of frequency resolutions and latencies and demonstrate convincing performance compared to other state-of-the-art models.

Authors (4)

Hendrik Schröter (9 papers)
Alberto N. Escalante-B. (9 papers)
Tobias Rosenkranz (8 papers)
Andreas Maier (394 papers)

Citations (64)

View on Semantic Scholar

Summary

Analysis of DeepFilterNet: A Low Complexity Speech Enhancement Framework

The paper presents a detailed exploration of DeepFilterNet, a novel speech enhancement framework leveraging the principles of deep filtering. The authors underscore the necessity of enhancing speech signals for various critical applications, like automatic speech recognition and assistive listening devices, thereby establishing the context of their research.

Technical Overview

DeepFilterNet capitalizes on complex-valued processing, moving beyond conventional time-frequency (TF) mask-based approaches common in speech enhancement, where complex masks (CM) are prioritized for their phase-modifying capabilities. Typically, these CMs are applied directly to the noisy spectrogram for noise reduction. The novel aspect of DeepFilterNet lies in its ability to use complex filters instead of point-wise multiplication masks, thereby incorporating temporal dependencies from past and future timesteps. This enhancement is achieved through a two-stage process that focuses first on the spectral envelope and then on the periodic components of the speech.

Numerical Insights and Methodological Contributions

Key assertions include DeepFilterNet's superiority over CMs across varying frequency resolutions and latencies. The framework's two-stage design incorporates ERB-scaled gains to enhance spectral envelopes while utilizing deep filtering for recovering speech periodicities. Notably, the paper demonstrates that DeepFilterNet maintains performance across multiple Fast Fourier Transform (FFT) sizes from \SI{5}{\ms} to \SI{30}{\ms}, unlike complex ratio masks where performance drops with lower FFT sizes. The results are reinforced by improved SI-SDR values in comparison to existing methods for various FFT sizes, showcasing robust enhancement capabilities even under constraints of low latency and computational complexity.

Implications and Future Directions

Practically, the framework's low complexity and high efficiency suggest strong potential for real-time applications, particularly where computational resources are limited. The framework's open-source nature further encourages broader adoption and adaptation in relevant systems. Theoretically, DeepFilterNet provides a compelling argument for the broader implementation of deep filtering over conventional complex masks in speech enhancement tasks.

The paper also implies potential for future research in enhancing perceptual models through more refined applications, such as using correlation-based metrics for assessing voiced probability. Further optimization and exploration may yield even more efficient algorithms capable of handling diverse auditory conditions with minimal computational overhead.

DeepFilterNet stands as a promising contribution to the domain of speech enhancement, providing a solid methodological framework usable in various fields of voice processing, and offers a foundation upon which future advancements will likely build.

PDF Markdown

Related Papers

YouTube

Show All Videos