DeepFilterNet: Real-Time Speech Enhancement

Updated 14 December 2025

DeepFilterNet is a two-stage speech enhancement framework that integrates multi-frame complex filtering with ERB gain estimation to deliver real-time, low-latency performance.
It employs STFT analysis combined with psychoacoustically motivated ERB-domain processing and an efficient encoder-decoder architecture to enhance speech on embedded devices.
The design leverages perceptual principles and SNR-gated decoding to optimize enhancement under varying noise conditions, validated through state-of-the-art benchmarks.

DeepFilterNet is a two-stage, real-time speech enhancement framework grounded in psychoacoustic and speech-production domain knowledge. It combines multi-frame complex filtering in the frequency domain (“deep filtering”) with perceptually motivated, coarse-resolution gain estimation in the equivalent rectangular bandwidth (ERB) domain. The system is engineered for low complexity and minimal latency, enabling deployment on embedded devices and hearing aids, while matching or exceeding state-of-the-art benchmarks in enhancement quality (Schröter et al., 2023).

1. System Architecture and Signal Flow

The DeepFilterNet pipeline consists of distinct analysis and enhancement stages:

STFT Analysis: The input waveform $x(k)$ is transformed to the complex short-time Fourier transform (STFT) $X(t,f)$ , using $N_{\text{FFT}} = 2048$ at 48 kHz, 20 ms windows with 50% overlap, and two-frame look-ahead, yielding a total algorithmic latency of 40 ms.
ERB-Domain Envelope Stage: The magnitude spectrum $|X(t,f)|$ comprising 481 bins (0–24 kHz) is projected via a fixed ERB filterbank to 32 psychoacoustically-spaced bands. Log-compression yields a feature vector that captures loudness in accordance with human perception.
Encoder and Decoders: A compact encoder—utilizing 1×1 and depth-wise separable convolutions and a recurrent block—processes the ERB features. It predicts:
- 32 envelope gains $g_i(t)\in [0,1]$ for coarse enhancement.
- Scalar frame-level SNR estimate $\xi(t)$ .
- For low-frequency bins ( $f<96$ ; $\lesssim4.8$ kHz), a complex multi-frame filter $\overline{w}(t,f)\in \mathbb{C}^5$ of length $N=5$ (2 look-ahead, 3 causal).
Deep Filtering: The enhanced spectrum for each $f<96$ is calculated as

$Y(t,f) = \overline{w}(t,f)^{H} \overline{x}(t,f),$

where $\overline{x}(t,f) = [X(t+2,f), X(t+1,f), X(t,f), X(t-1,f), X(t-2,f)]^{T}$ .

Reconstruction: For $f<96$ , $Y(t,f)$ is used; for higher bins, envelope enhancement is applied with the original phase. ISTFT reconstruction yields the enhanced time-domain output.

2. Multi-Frame Complex Filter Estimation

DeepFilterNet directly regresses multi-frame complex filter taps for each low-frequency bin. The network learns to approximate clean speech STFT $S(t,f)$ by filtering stacked noisy STFT vectors $\overline{x}(t,f)$ with predicted taps, rather than applying simple time-frequency masks. This approach exploits short-time spectral correlations—critical for restoring periodicity (both amplitude and phase) under low SNR.

3. Psychoacoustic and Speech-Production Principles

The framework is engineered around several perceptual and structural properties:

ERB Filterbank: Compresses 481 bins to 32 bands, mirroring human frequency resolution and reducing input/output dimensionality by more than 15×.
Logarithmic Loudness Domain: Operations and loss on the ERB features are performed in the log domain to reflect human loudness sensitivity.
Two-Component Enhancement: The system splits processing into coarse envelope (primarily below 5 kHz, matching speech energy distribution and ear sensitivity) and periodic/voiced enhancement (deep filtering below 4.8 kHz).
SNR-Gated Decoding: A learned SNR estimate disables decoders in extreme noise ( $\xi<-10$ dB) or very clean ( $\xi>20$ dB) frames, minimizing unnecessary compute.

4. Model Complexity and Real-Time Feasibility

Parameterization and compute design allow real-time execution:

Parameter Count: The encoder dominates with $\approx$ 200–300k parameters; decoders have $<$ 100k each.
Compute Load: The architecture restricts deep filtering to 96 bins × 5 taps (480 complex outputs), supplemented with 32 envelope gains.
Efficiency: On a single-threaded Intel i5-8250U CPU, the system achieves a real-time factor (RTF) of 0.19—well below the real-time threshold. Deployment is feasible on hearing aids and embedded devices without model pruning or quantization.

5. Training Strategies and Loss Functions

Training protocol leverages multi-lingual and high-quality corpora with domain-matched losses:

Data: DNS4 corpus with heavy oversampling of PTDB and VCTK speakers.
Loss Functions:
- Mean squared error in the log domain for the 32 ERB gains.
- Complex-domain MSE on the deep filter output for periodicity (amplitude and phase) recovery.
Optimization: AdamW optimizer with learning-rate scheduling and early stopping applied on validation data.

6. Empirical Benchmarks

DeepFilterNet demonstrates strong performance on established test sets:

Metric	DeepFilterNet3	DeepFilterNet2	RNNoise / PercepNet (typical)
WB-PESQ	3.17	3.08	2.8–3.0
CSIG (Signal Quality)	4.34	4.30	—
CBAK (Background Quality)	3.61	3.40	—
COVL (Overall Quality)	3.77	3.70	—
STOI (Intelligibility)	0.944	0.943	—

These results on VCTK/DEMAND test sets show DeepFilterNet outperforming or matching prior real-time models, while maintaining minimal compute and latency (Schröter et al., 2023).

7. Open-Source Distribution and Deployment

Implementation: The full pipeline is written in Rust, targeting the tract inference engine, with ready-to-use Python and Rust inference scripts.
Repository: Source code and pretrained weights for DeepFilterNet3 are available under a permissive open-source license at https://github.com/Rikorose/DeepFilterNet.
Deployment: Includes a live Rust+LADSPA plugin for microphone noise suppression, comprehensive training instructions, and evaluation tools for all major speech enhancement metrics.

By integrating psychoacoustic filterbanks with multi-frame complex filtering, DeepFilterNet provides an efficient reference for full-band speech enhancement tasks in embedded, hearing-aid, and live communications settings.

Markdown Upgrade to Chat

References (1)

DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepFilterNet Framework.