A Wavenet for Speech Denoising (1706.07162v3)

Published 22 Jun 2017 in cs.SD

Abstract: Currently, most speech processing techniques use magnitude spectrograms as front-end and are therefore by default discarding part of the signal: the phase. In order to overcome this limitation, we propose an end-to-end learning method for speech denoising based on Wavenet. The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature. Specifically, the model makes use of non-causal, dilated convolutions and predicts target fields instead of a single target sample. The discriminative adaptation of the model we propose, learns in a supervised fashion via minimizing a regression loss. These modifications make the model highly parallelizable during both training and inference. Both computational and perceptual evaluations indicate that the proposed method is preferred to Wiener filtering, a common method based on processing the magnitude spectrogram.

Citations (414)

View on Semantic Scholar

Summary

The paper presents an adapted, non-autoregressive Wavenet model for end-to-end speech denoising directly on raw audio, departing from traditional spectrogram methods.
Key architectural changes include non-causal dilated convolutions for broader context, parallel prediction for efficiency, and real-valued outputs for smoother denoising.
Experiments show the modified Wavenet achieves superior perceptual quality and computational efficiency compared to benchmark methods like Wiener filtering.

A Wavenet for Speech Denoising: An In-depth Analysis

The paper "A Wavenet for Speech Denoising" by Dario Rethage, Jordi Pons, and Xavier Serra presents a novel end-to-end learning method for speech denoising employing an adapted Wavenet architecture. This approach challenges the prevailing reliance on magnitude spectrograms as a front-end by directly utilizing raw audio. By discarding the autoregressive nature of the original Wavenet, the model offers promising results in both computational efficiency and perceptual performance.

Key Methodological Contributions

The paper introduces several significant modifications to the conventional Wavenet framework to repurpose it for the task of speech denoising:

Non-causal, Dilated Convolutions: The authors eliminate the autoregressive constraint of Wavenet, opting for non-causality to leverage future samples available in non-real-time applications. By symmetrically padding and using longer filter lengths, they ensure access to contextually rich past and future samples, significantly broadening the model's receptive field.
Target Field Prediction: Unlike its precursor, which predicts one sample at a time, this adaptation predicts multiple samples in parallel. This design change reduces time-complexity and improves inference speed, making the model highly parallelizable and suitable for real-time applications.
Real-valued Predictions: Moving away from discrete softmax outputs and $\mu$ -law quantization, the model predicts raw audio values. This adjustment aids in minimizing high variance and artifacts, aligning better with the task of denoising which is seen to prefer uni-modal output distributions over complex probabilistic distributions.
Energy-Conserving Loss: A novel loss function that enforces energy conservation between denoised and noisy signals is proposed. This loss, which includes both speech and background estimates, is shown to outperform traditional L1 loss in accuracy while maintaining the energy integrity of input signals.
Conditional Input: The model incorporates speaker identity conditioning, enabling it to adapt more closely to the specific characteristics of different speakers. This modification results in marginal, yet statistically significant, improvements in performance.

Experimental Validation

The authors conduct comprehensive experiments, comparing their Wavenet adaptation against a standard Wiener filtering method—a benchmark in speech denoising. With the dataset featuring diverse noise conditions and speakers, the discriminative Wavenet consistently demonstrates superior denoising across common metrics (SIG, BAK, OVL) and perceptual evaluations.

Perceptual Quality:

Subjective Mean Opinion Scores (MOS) indicate that the denoised outputs from the Wavenet approach are preferred significantly over those of the Wiener-filtered signals. This highlights the model’s ability to balance noise suppression with speech quality retention effectively.

Computational Efficiency:

The adoption of non-causal, dilated convolutions and batch prediction of samples underscores the model's efficiency, denoting a substantial reduction in computation time and resources as compared to the traditional autoregressive approach.

Implications and Future Directions

This work provides a robust framework pointing towards the potential efficacy of end-to-end learning models, such as the modified Wavenet, for audio processing tasks like speech denoising. By eliminating reliance on spectrograms and embracing raw audio inputs, the model opens avenues for further research into direct time-domain audio processing, which may theoretically enhance model adaptability and accuracy across varied acoustic environments.

The authors speculate that future development could explore optimizations on network architecture and conditioning strategies to further exploit speaker-specific features or generalization to novel noise conditions. The adaptation of this method for other signal processing challenges, such as audio source separation, also remains a promising area of application.

Overall, this paper advances the discourse on leveraging discriminative models in speech processing, promoting efficiency without compromising sound quality, and sets a precedent for future research endeavors in end-to-end audio processing technologies.