DeepFilterNet: Real-Time Speech Enhancement
- DeepFilterNet is a two-stage speech enhancement framework that integrates multi-frame complex filtering with ERB gain estimation to deliver real-time, low-latency performance.
- It employs STFT analysis combined with psychoacoustically motivated ERB-domain processing and an efficient encoder-decoder architecture to enhance speech on embedded devices.
- The design leverages perceptual principles and SNR-gated decoding to optimize enhancement under varying noise conditions, validated through state-of-the-art benchmarks.
DeepFilterNet is a two-stage, real-time speech enhancement framework grounded in psychoacoustic and speech-production domain knowledge. It combines multi-frame complex filtering in the frequency domain (“deep filtering”) with perceptually motivated, coarse-resolution gain estimation in the equivalent rectangular bandwidth (ERB) domain. The system is engineered for low complexity and minimal latency, enabling deployment on embedded devices and hearing aids, while matching or exceeding state-of-the-art benchmarks in enhancement quality (Schröter et al., 2023).
1. System Architecture and Signal Flow
The DeepFilterNet pipeline consists of distinct analysis and enhancement stages:
- STFT Analysis: The input waveform is transformed to the complex short-time Fourier transform (STFT) , using at 48 kHz, 20 ms windows with 50% overlap, and two-frame look-ahead, yielding a total algorithmic latency of 40 ms.
- ERB-Domain Envelope Stage: The magnitude spectrum comprising 481 bins (0–24 kHz) is projected via a fixed ERB filterbank to 32 psychoacoustically-spaced bands. Log-compression yields a feature vector that captures loudness in accordance with human perception.
- Encoder and Decoders: A compact encoder—utilizing 1×1 and depth-wise separable convolutions and a recurrent block—processes the ERB features. It predicts:
- 32 envelope gains for coarse enhancement.
- Scalar frame-level SNR estimate .
- For low-frequency bins (; kHz), a complex multi-frame filter of length (2 look-ahead, 3 causal).
- Deep Filtering: The enhanced spectrum for each is calculated as
where .
- Reconstruction: For , is used; for higher bins, envelope enhancement is applied with the original phase. ISTFT reconstruction yields the enhanced time-domain output.
2. Multi-Frame Complex Filter Estimation
DeepFilterNet directly regresses multi-frame complex filter taps for each low-frequency bin. The network learns to approximate clean speech STFT by filtering stacked noisy STFT vectors with predicted taps, rather than applying simple time-frequency masks. This approach exploits short-time spectral correlations—critical for restoring periodicity (both amplitude and phase) under low SNR.
3. Psychoacoustic and Speech-Production Principles
The framework is engineered around several perceptual and structural properties:
- ERB Filterbank: Compresses 481 bins to 32 bands, mirroring human frequency resolution and reducing input/output dimensionality by more than 15×.
- Logarithmic Loudness Domain: Operations and loss on the ERB features are performed in the log domain to reflect human loudness sensitivity.
- Two-Component Enhancement: The system splits processing into coarse envelope (primarily below 5 kHz, matching speech energy distribution and ear sensitivity) and periodic/voiced enhancement (deep filtering below 4.8 kHz).
- SNR-Gated Decoding: A learned SNR estimate disables decoders in extreme noise ( dB) or very clean ( dB) frames, minimizing unnecessary compute.
4. Model Complexity and Real-Time Feasibility
Parameterization and compute design allow real-time execution:
- Parameter Count: The encoder dominates with 200–300k parameters; decoders have 100k each.
- Compute Load: The architecture restricts deep filtering to 96 bins × 5 taps (480 complex outputs), supplemented with 32 envelope gains.
- Efficiency: On a single-threaded Intel i5-8250U CPU, the system achieves a real-time factor (RTF) of 0.19—well below the real-time threshold. Deployment is feasible on hearing aids and embedded devices without model pruning or quantization.
5. Training Strategies and Loss Functions
Training protocol leverages multi-lingual and high-quality corpora with domain-matched losses:
- Data: DNS4 corpus with heavy oversampling of PTDB and VCTK speakers.
- Loss Functions:
- Mean squared error in the log domain for the 32 ERB gains.
- Complex-domain MSE on the deep filter output for periodicity (amplitude and phase) recovery.
- Optimization: AdamW optimizer with learning-rate scheduling and early stopping applied on validation data.
6. Empirical Benchmarks
DeepFilterNet demonstrates strong performance on established test sets:
| Metric | DeepFilterNet3 | DeepFilterNet2 | RNNoise / PercepNet (typical) |
|---|---|---|---|
| WB-PESQ | 3.17 | 3.08 | 2.8–3.0 |
| CSIG (Signal Quality) | 4.34 | 4.30 | — |
| CBAK (Background Quality) | 3.61 | 3.40 | — |
| COVL (Overall Quality) | 3.77 | 3.70 | — |
| STOI (Intelligibility) | 0.944 | 0.943 | — |
These results on VCTK/DEMAND test sets show DeepFilterNet outperforming or matching prior real-time models, while maintaining minimal compute and latency (Schröter et al., 2023).
7. Open-Source Distribution and Deployment
- Implementation: The full pipeline is written in Rust, targeting the tract inference engine, with ready-to-use Python and Rust inference scripts.
- Repository: Source code and pretrained weights for DeepFilterNet3 are available under a permissive open-source license at https://github.com/Rikorose/DeepFilterNet.
- Deployment: Includes a live Rust+LADSPA plugin for microphone noise suppression, comprehensive training instructions, and evaluation tools for all major speech enhancement metrics.
By integrating psychoacoustic filterbanks with multi-frame complex filtering, DeepFilterNet provides an efficient reference for full-band speech enhancement tasks in embedded, hearing-aid, and live communications settings.