Real-Time Speech Enhancement Framework
- Real-time speech enhancement frameworks are defined as causal, low-latency pipelines that improve speech intelligibility using both classical signal processing and deep neural methods.
- They employ multi-stage architectures—from signal acquisition and feature extraction to enhancement and reconstruction—to minimize latency and ensure high-quality output.
- Key innovations include adaptive algorithms, multi-modal fusion, and resource-efficient designs tailored for deployment on edge devices in diverse acoustic environments.
Real-time speech enhancement frameworks are algorithmic and neural-system pipelines that perform low-latency, causal processing to improve the perceptual quality and intelligibility of speech signals in the presence of background noise, reverberation, and competing speech. These frameworks are distinguished from offline enhancement methods by their strict real-time constraints: each output frame is generated using information only from past and, at most, a limited number of future input frames, usually with total algorithmic latency on the order of 1–40 ms. Modern real-time speech enhancement systems span classical spatially-informed models, deep neural architectures, adaptive/continual learning strategies, and hybrid approaches that exploit domain-specific priors, psychoacoustic models, or multi-modal cues.
1. Algorithmic Pipelines and Architectural Paradigms
Most real-time speech enhancement systems share a causal, frame-synchronous signal-processing pipeline, divided into five canonical stages:
- Signal Acquisition and Framing: Audio (and optionally video) is sampled and organized into overlapping frames. Window sizes and hop sizes are typically chosen to balance spectral resolution and latency (e.g., 20–32 ms window, 8–16 ms hop).
- Front-End Feature Extraction: Signals undergo front-end transforms, commonly the short-time Fourier transform (STFT), perceptually-motivated filterbanks (e.g., ERB), or time-domain convolutional encoders. Some frameworks employ dual-path analysis (e.g., magnitude+phase or complex-valued features) or integrate raw waveform processing.
- Enhancement Core: The main enhancement is accomplished by a variety of mechanisms:
- Classical models: Spatial filtering (e.g., GCC-NMF, delay-and-sum beamforming (Wood et al., 2019, Kealey et al., 2023)), and masking.
- Deep architectures: U-Nets, recurrent networks (GRU/LSTM), convolutional-recurrent networks (CRN), attention-based models (Transformers, MHSA), or hybrid networks (ViT-based fusion, Mamba blocks) (Lu et al., 1 Jun 2025, Schröter et al., 2023, Bahmei et al., 14 Nov 2025, Serbest et al., 29 May 2025).
- Two-stage pipelines: Magnitude enhancement followed by complex or phase refinement, sometimes across different domains (e.g., STFT then STDCT) (Zhang et al., 2024).
- Multi-modal integration: Audio-visual fusion using temporal alignment and multimodal LSTM/Transformer blocks (Zhu et al., 2023, Gogate et al., 2021, Chen et al., 2024).
- Reconstruction: Enhanced spectra are processed with inverse STFT or decoder blocks to synthesize the time-domain output.
- Streaming/Post-processing: Overlap-add synthesis, pitch-tracking + comb postfilters, and VAD-guided smoothing are employed to further reduce artifacts and cope with latency constraints (Valin et al., 2020).
Pipelines are constrained to strictly causal designs or minimal look-ahead (e.g., a bounded number of future frames for pitch tracking), yielding total algorithmic latency as low as 2–5 ms (Wood et al., 2019), 8–40 ms for most neural models (Lu et al., 1 Jun 2025, Schröter et al., 2023, Serbest et al., 29 May 2025).
2. Core Modeling Techniques and Domains
Real-time speech enhancement frameworks vary widely in their core modeling choices, which are shaped by latency, compute, and deployment targets:
- Spatial-Feature and NMF Hybrid Methods: RT-GCC-NMF combines two-channel generalized cross-correlation phase transform (GCC-PHAT) for TDOA estimation with a universal NMF magnitudes dictionary, yielding an atom-to-source association mechanism and a per-frame soft mask for interference suppression. The method operates framewise and achieves latencies down to 2–3 ms through asymmetric STFT windowing (Wood et al., 2019).
- Frequency-Temporal Decoupled Filtering: Hierarchical deep filtering approaches predict temporal and frequency filter coefficients in separate stages. HDF-Net, for instance, uses a two-stage pipeline where the first stage models coarse spectral/temporal periodicity, while the second offers fine spectral correction via frequency deep filtering, integrated with sub-band fusion and lightweight temporal-attention blocks (TAConv) (Lu et al., 1 Jun 2025).
- Direct Masking and Envelope Modeling: Many frameworks employ real/complex masking—e.g., DeepFilterNet applies multi-frame complex filters in low-frequency bins and ERB-band envelope gain in high-frequency bins, using psychoacoustic domain compression (Schröter et al., 2023). PercepNet targets the spectral envelope and periodicity, combining lightweight convolutional, recurrent, and pitch-tracking components for fullband, very low-compute enhancement (Valin et al., 2020).
- Two-Stage and Cross-Spectral Pipelines: Advanced models execute different stages in different spectral domains. FDFNet enhances magnitude in the STFT domain, then refines the output in the STDCT domain, leveraging easier phase recovery and strong noise suppression in the latter, leading to improved causal performance metrics (Zhang et al., 2024).
- GAN-Based Stochastic Regeneration and Hybrid Cascades: DeepFilterGAN combines a predictive, lightweight front-end with a GAN-based back-end “stochastic regenerator” that refines over-suppressed spectra and recovers detail, with direct conditioning on noisy + enhanced features; final outputs achieve strong NISQA-MOS with just 40 ms latency (Serbest et al., 29 May 2025). Gesper integrates a restoration-focused complex-spectral mapping GAN followed by parallel fullband/wideband enhancement networks (Liu et al., 2023).
- Edge-Optimized U-Nets and Attention Mechanisms: Models targeting real-time edge deployment use quantized, pruned U-Net variants with attention gates and novel “Reverse Attention” modules for resource efficiency and rapid inference (Ojha et al., 20 Sep 2025); others exploit kernel fusion and per-frame normalization to minimize latency and memory (Ahn et al., 26 Sep 2025).
3. Latency, Efficiency, and Real-World Deployment
Latency and computational efficiency are primary concerns:
- Algorithmic Latency: Practical frameworks report end-to-end processing delays ranging from <3 ms (GCC-NMF in asymmetric STFT mode (Wood et al., 2019)) to 20–40 ms for most CNN/RNN and Transformer architectures (Schröter et al., 2023, Lu et al., 1 Jun 2025, Defossez et al., 2020). Audio-visual models (e.g., RT-LA-VocE) achieve the theoretical minimum frame-by-frame latency (40 ms) and actual per-frame processing times as low as 28 ms on consumer GPUs (Chen et al., 2024).
- Resource Requirements: State-of-the-art CPU-optimized systems achieve real-time factors (RTF) down to 0.04 (25× real time) on notebook CPUs for fullband audio (Schröter et al., 2022), and sub-5 ms/frame on embedded ARM/DSP for INT8-quantized U-Nets (Ojha et al., 20 Sep 2025). Memory footprints for streaming models can be <3 KB for adaptation parameters (Cheng et al., 8 Mar 2026), or <1.2 MB for INT8 quantized models suitable for hearables.
- Optimization Tactics: Techniques include separable and grouped convolutions, kernel/batchnorm fusion, fixed-point quantization, minimal look-ahead constraints, and ring-buffered streaming for minimum temporal delay and peak RAM utilization (Ahn et al., 26 Sep 2025, Ojha et al., 20 Sep 2025).
- Adaptation and Robustness: Fast on-device adaptation via low-rank adapters (LoRA) allows models to update <1% of parameters per acoustic scene, yielding +1.5 dB SI-SDR improvement in <20 updates and robust generalization across 111 real scenes (Cheng et al., 8 Mar 2026).
4. Learning, Loss Functions, and Training Regimes
Training protocols and objectives vary across frameworks but are tailored to reinforce high speech fidelity and artifact-free denoising under real-time constraints:
- Loss Functions: Hybrid loss objectives combining compressed-spectral MSE, multi-resolution STFT losses, time-domain L1, SI-SDR, adversarial objectives (LS-GAN, MelGAN), and psychoacoustic/entropy-weighted masking dominate (Schröter et al., 2023, Serbest et al., 29 May 2025, Liu et al., 2023). Several works introduce SNR-weighted MSE for explicit trade-off between speech preservation and noise suppression (Xia et al., 2020).
- Speech/Noise Trade-off: Models expose hyperparameters (e.g., mask width α, floor η in RT-GCC-NMF; activity-weight μ in UPN) to permit explicit control over the aggressiveness of noise suppression versus speech fidelity (Wood et al., 2019, Wang et al., 2023).
- Personalization/Conditioning: Unified frameworks (UPN) support both general and personal (target-speaker) enhancement by framewise injection of speaker embeddings, with data augmentation enhancing embedding robustness (Wang et al., 2023).
- Domain Knowledge: Integrating pitch tracking, psychoacoustic compression (e.g., log-ERB, envelope masking), and periodicity-based comb postfilters are shown to substantially improve perceptual quality at low compute cost (Valin et al., 2020, Schröter et al., 2023).
5. Evaluation Metrics and Benchmark Results
Objective and subjective metrics, benchmark datasets, and ablations are standardly employed:
- Objective Metrics: OPS (PEASS), SI-SDR, PESQ, STOI, eSTOI, CSIG, CBAK, COVL. NISQA-MOS and DNSMOS increasingly serve as perceptual proxies. Many frameworks also report downstream WER/ASR impact (Lu et al., 1 Jun 2025, Serbest et al., 29 May 2025, Valin et al., 2020).
- Typical Results: RT-GCC-NMF achieves OPS ≈ 38 dB, STOI ≈ 0.72, and up to +10 dB SDR improvements in low SNR regimes (Wood et al., 2019). HDF-Net sets WB-PESQ = 3.01 with only 0.2M parameters (Lu et al., 1 Jun 2025), while FDFNet achieves state-of-the-art WB-PESQ = 3.05 with 4.43M parameters on VoiceBank+DEMAND (Zhang et al., 2024). FastEnhancer base model matches or exceeds other streaming baselines at RTF of 0.022 and PESQ of 3.13 (Ahn et al., 26 Sep 2025).
- Ablations and Trade-offs: Several systems report quantitative ablations of model components (e.g., attention span, fusion mechanisms), confirming performance gains attributable to proposed mechanisms (Zhang et al., 2024, Serbest et al., 29 May 2025). Edge-optimized models show negligible WER/PESQ degradation under 20% pruning or INT8 quantization (Ojha et al., 20 Sep 2025).
6. Extensions: Audio-Visual, Multi-Modal, and Adaptive Systems
Modern frameworks leverage additional information for robust enhancement:
- Audio-Visual Systems: Multiple systems (e.g., AV-E3Net, RT-LA-VocE) tightly fuse audio and visual features (lips, mouth ROI) using multi-stage gating, Transformer-based fusion, and causal vocoder blocks. Visual context enables significant improvements in low SNR/overlap utterances, especially under multi-talker/reverberant conditions (Zhu et al., 2023, Chen et al., 2024, Gogate et al., 2021).
- Adaptive and Lightweight Adaptation: LoRA-based adaptation robustly updates only adapter parameters online, yielding monotonic improvement in unseen scenes without catastrophic forgetting (Cheng et al., 8 Mar 2026).
- Unified Personalization: Real-time frameworks exist for joint personalized and non-personalized enhancement; framewise control toggles between general and target-speaker output, optimizing a multitask loss with VAD-weighted supervision (Wang et al., 2023).
7. Summary Table: Core Properties of Representative Real-Time Frameworks
| Framework | Domain | Architecture | Latency | CPU/Edge RTF | Notable Metrics | Reference |
|---|---|---|---|---|---|---|
| RT-GCC-NMF | TF/multi-mic | NMF + GCC-PHAT | 2–3 ms | <1.0 | OPS=38dB, STOI=0.72 | (Wood et al., 2019) |
| HDF-Net | TF | Hier. Deep FilterNet | <20 ms | <0.05 | PESQ=3.01 (0.2M params) | (Lu et al., 1 Jun 2025) |
| DeepFilterNet/DFNet2 | TF | Dual-decoder RNN | 40 ms | 0.04 | PESQ=3.17, STOI=0.94 | (Schröter et al., 2022) |
| FastEnhancer | TF | RNNFormer (GRU+MHSA) | ~16 ms | 0.012–0.022 | PESQ=3.13 (Base) | (Ahn et al., 26 Sep 2025) |
| RT-LA-VocE | AV | 3D-ResNet+Emformer | 40 ms | 28 ms/frame† | PESQ=1.40, STOI=0.70 | (Chen et al., 2024) |
| DeepFilterGAN | TF | DFNet2+GAN (Mamba) | 40 ms | 0.8×RT | NISQA-MOS=3.12 | (Serbest et al., 29 May 2025) |
| FDFNet | TF&DCT | 2-stage CRN+TFSM | <40 ms | <1.0 | WB-PESQ=3.05 | (Zhang et al., 2024) |
| Reverse Attention U-Net | TF | U-Net+AG+RA, INT8 | <5 ms | ~3–5 ms‡ | PESQ up to 2.99 | (Ojha et al., 20 Sep 2025) |
†Measured on RTX 2080 Ti + i7-9700K; ‡on ARM Cortex-A55/Hexagon DSP; all results for streaming/causal inference.
References
- "Unsupervised Low Latency Speech Enhancement with RT-GCC-NMF" (Wood et al., 2019)
- "Towards Lightweight Adaptation of Speech Enhancement Models in Real-World Environments" (Cheng et al., 8 Mar 2026)
- "A Two-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement" (Lu et al., 1 Jun 2025)
- "DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement" (Schröter et al., 2023)
- "DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio" (Schröter et al., 2022)
- "Reverse Attention for Lightweight Speech Enhancement on Edge Devices" (Ojha et al., 20 Sep 2025)
- "DeepFilterGAN: A Full-band Real-time Speech Enhancement System with GAN-based Stochastic Regeneration" (Serbest et al., 29 May 2025)
- "FastEnhancer: Speed-Optimized Streaming Neural Speech Enhancement" (Ahn et al., 26 Sep 2025)
- "A Two-Stage Framework in Cross-Spectrum Domain for Real-Time Speech Enhancement" (Zhang et al., 2024)
- "RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement" (Chen et al., 2024)
- "A Framework for Unified Real-time Personalized and Non-Personalized Speech Enhancement" (Wang et al., 2023)
- "Gesper: A Restoration-Enhancement Framework for General Speech Reconstruction" (Liu et al., 2023)
- "Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Feature Fusion" (Bahmei et al., 14 Nov 2025)
- "Real-Time Speech Enhancement with Dynamic Attention Span" (Zheng et al., 2023)
- "A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech" (Valin et al., 2020)
- Other references as detailed in the corresponding primary and derived works.