- The paper introduces a deep learning model using sigmoid-driven ideal ratio masking to achieve sub-10 ms latency with improved PESQ scores in both stationary and nonstationary noise conditions.
- The methodology leverages a band-grouped encoder-decoder design with frequency attention while eliminating recurrent components for causal, low-latency processing.
- Extensive training on diverse datasets with targeted loss functions underscores the model’s potential for real-time voice communication and edge device deployment.
Real-Time Band-Grouped Vocal Denoising via Sigmoid-Driven Ideal Ratio Masking
Introduction
This paper introduces a real-time deep learning model for vocal denoising that achieves sub-10 ms total latency while improving objective speech quality metrics in both stationary and nonstationary noise environments. The architecture is specifically optimized for causal, low-latency deployment, notably omitting recurrent neural network components and extensive temporal lookahead, both of which are bottlenecks in standard approaches. The primary technical contributions are a sigmoid-driven ideal ratio mask (IRM) with loss functions focused on SNR and perceptual quality, and a frequency band-grouped encoder-decoder design with frequency attention.
Deep learning-based speech enhancement has surpassed traditional approaches in noise suppression, particularly under nonstationary noise. However, prominent models leveraging CRNNs, LSTMs, or GRUs—while effective in denoising—incur high latency due to their reliance on long context frames or recurrent processing. Alternatives such as the IBM, IRM, and CRM present nuanced trade-offs between denoising efficacy and computational expense. IBM’s binary masking introduces musical artifacts; CRM obtains improved perceptual quality through joint magnitude and phase processing at significant computational cost and higher parameter counts. This work positions itself at the intersection of practical deployment and quality enhancement, favoring magnitude-only masking for real-time constraints.
Methodology
Signal Representation and Preprocessing
Input audio is sampled at 16 kHz. Short-Time Fourier Transform (STFT) frames of 512 samples and 64-sample hops (≈4 ms/frame) are utilized, maintaining minimal algorithmic latency. Each frame is processed with a rolling buffer of the past 8 frames, capturing temporal context while preserving causality. Inputs consist of clipped, normalized magnitudes and the corresponding noisy phase.
Architecture
The model realizes a low-parameter, stateless encoder-decoder architecture (≈450k parameters) with frequency band grouping, grouping 257 frequency bins into bands of 8, optimizing between expressivity and computational efficiency. Dense layers capture local frequency correlations within bands, enabling parameter reduction compared to 2D convolutional approaches. Two encoder stages, a squeeze-excitation-based frequency attention bottleneck, and a single decoder with U-Net-style skip connections constitute the core. A sigmoid activation in the terminal layer drives mask values toward binary boundaries, reinforcing clear separation.
No internal RNN memory is maintained; all temporal support derives from input frame buffering. This design simplifies real-time inference.
Objective and Training Regime
The model is trained to estimate an IRM:
M=clip(∣Y∣+ϵ∣S∣​,0,1)
with a composite loss: direct mask MSE, weighted log-magnitude L1 (0.3), and magnitude L1 (0.2). The latter two target perceptual objectives—maintaining log-magnitude similarity and spectral peak sharpness.
Training data spans seven languages and over 700 environments, incorporating datasets such as Saraga Carnatic Music, CommonVoice, Noisy Speech Database, GTSinger, SingingDatabase, VocalSet, and Acapella Mandarin. Randomized SNR mixing, pitch and gain augmentations, and additive Gaussian noise enforce robustness and prevent overfitting to narrow domains.
Experimental Results
Evaluations employ objective metrics—wideband/narrowband PESQ, STOI, and ESTOI—across stationary and nonstationary noise conditions. The model yields substantial wideband PESQ improvements: +0.2111 (stationary) and +0.1225 (nonstationary). Notably, there is a marginal STOI decrease (≈–0.016), a characteristic trade-off of magnitude-only masking architectures prioritizing aggressive noise suppression at low latency over maximized intelligibility scores. ESTOI changes are negligible, indicating that intelligibility and temporal envelopes are well-preserved even as SNR and perceptual quality improve.
Performance on nonstationary noise conditions is competitive, albeit with smaller numerical gains than stationary, demonstrating the model’s capability in difficult environments absent attention or recurrent memory.
Latency analysis confirms end-to-end system latency of just 6.214 ms (4 ms frame delay + 2.214 ms CPU inference), well under both the 10 ms real-time requirement and typical audio-visual sync thresholds (20–40 ms). Thus, the approach is suited to live voice communication settings (telephony, live streaming, monitoring), especially where phase-aware or high-latency architectures are infeasible.
Theoretical and Practical Implications
The findings demonstrate that frequency band-grouped, stateless contexts—when paired with a sigmoid-driven IRM and appropriate training objectives—can achieve real-time denoising with competitive speech quality improvement. The study challenges the prevailing emphasis on complex masking (joint magnitude+phase) and recurrent state for real-time applications, showing that magnitude masking leveraging recent contextual frames is a feasible and efficient alternative.
This approach enables deployment on edge devices and mobile platforms and generalizes across language and noise conditions, given the scale and diversity of training data. The stateless design is also advantageous for distributed processing pipelines with strict latency budgets.
Future Directions
Further research could integrate lightweight complex-domain processing to determine if incremental phase enhancement can be performed without violating sub-10 ms latency. Additionally, hybrid time-frequency approaches, adaptive band grouping mechanisms, or integration with attention-based architectures could be explored for robustness in highly dynamic vocal-noise mixtures. Expansion to multitask learning (e.g., simultaneous denoising and speaker identification) and evaluation on conversational quality metrics (e.g., ITU-T P.835) would additionally inform practical deployment.
Conclusion
The presented model achieves real-time vocal denoising with <10 ms total latency via a parameter- and compute-efficient band-grouped encoder-decoder. Significant PESQ improvements are reported for both stationary and nonstationary noise, with minimal trade-off on intelligibility, validating magnitude-only IRM enhancement in low-latency scenarios. The methodology and results underscore a viable path for deploying deep learning denoising in interactive audio applications where latency is constrained and recurrent or phase modeling is infeasible.
Reference:
Williams, D. "Real-Time Band-Grouped Vocal Denoising Using Sigmoid-Driven Ideal Ratio Masking" (2603.29326)