Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Real Time Speech Enhancement in the Waveform Domain (2006.12847v3)

Published 23 Jun 2020 in eess.AS, cs.LG, cs.SD, and stat.ML

Abstract: We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working directly on the raw waveform.

Citations (403)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a causal model that directly processes raw waveforms for real-time speech enhancement on CPU.
  • It employs a convolutional encoder-decoder with LSTM sequence modeling, integrating time- and frequency-domain loss functions for optimal audio quality.
  • Empirical evaluations demonstrate improved PESQ and STOI scores, underscoring its practical use in ASR systems, hearing aids, and communication devices.

Real Time Speech Enhancement in the Waveform Domain

Introduction

The paper introduces a causal speech enhancement model that operates directly on the raw audio waveform and is capable of running in real-time on laptop CPUs. This is crucial for applications such as audio and video calls, hearing aids, and automatic speech recognition (ASR) systems. The model showcases a blend of time-domain and frequency-domain optimization through multiple loss functions and innovative data augmentation techniques.

Model Architecture

Encoder-Decoder Framework

The model utilizes a convolutional encoder-decoder architecture resembling a U-Net, enhanced with LSTM-based sequence modeling capabilities. The encoder processes the raw waveform into a latent space, where a sequence model (using LSTMs) processes it further. The decoder reconstructs the denoised waveform from this latent representation.

Skip Connections and Resampling

Skip connections bridge encoder and decoder layers, significantly enhancing information flow and gradient propagation. Additionally, upsampling of the audio by a factor U before encoding, and corresponding downsampling after decoding, increases accuracy. This resampling is seamlessly integrated into the model through a sinc interpolation filter.

Loss Functions

The loss function is critical as it includes both time-domain (L1 loss on the waveform) and frequency-domain components (multi-resolution STFT loss). The STFT loss is critical for spectral convergence and preserving magnitude frequency characteristics, ensuring both audio quality and intelligibility.

Implementation Details

The model's implementation, highlighted by its causal capability, allows it to process audio in real-time constraints. The causal Demucs model, optimized for low computational footprint, uses a configuration tailored for running on single-core CPU systems while maintaining a real-time factor (RTF) below one.

Empirical Evaluation

The model is benchmarked against state-of-the-art methods on datasets like Valentini and DNS Challenge, utilizing both objective measures like PESQ and STOI, and subjective evaluations such as MOS scores. It demonstrates competitive performance, surpassing many traditional and advanced models, both casual and non-casual.

Ablation Studies

A rigorous ablation paper delineates the contributions of different model components, including data augmentations like BandMask and logrithmic reverberations, to the overall performance.

Practical Implications and Real-World Applications

The architecture's practical implications focus on enhancing audio quality for real-world noisy environments. The model's ability to significantly enhance ASR performance under noisy conditions showcases its utility in improving machine learning task outcomes beyond just human auditory perception improvements.

Conclusion

The research demonstrates a robust method for real-time speech enhancement using waveform-domain processing. The model holds substantial potential for integration into consumer audio devices and communication systems owing to its low computational requirements and efficiency in real-world noisy conditions. The fusion of advanced convolutional structures with efficient sequence modeling establishes a new standard in causal speech enhancement methodologies. The availability of code and model parameters ensures its reproducibility and adaptability in related applications.

Youtube Logo Streamline Icon: https://streamlinehq.com