Speech Denoising by Accumulating Per-Frequency Modeling Fluctuations (1904.07612v3)

Published 16 Apr 2019 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: We present a method for audio denoising that combines processing done in both the time domain and the time-frequency domain. Given a noisy audio clip, the method trains a deep neural network to fit this signal. Since the fitting is only partly successful and is able to better capture the underlying clean signal than the noise, the output of the network helps to disentangle the clean audio from the rest of the signal. This is done by accumulating a fitting score per time-frequency bin and applying the time-frequency domain filtering based on the obtained scores. The method is completely unsupervised and only trains on the specific audio clip that is being denoised. Our experiments demonstrate favorable performance in comparison to the literature methods. Our code and samples are available at github.com/mosheman5/DNP and as supplementary. Index Terms: Audio denoising; Unsupervised learning

Citations (16)

View on Semantic Scholar

Summary

The paper introduces an unsupervised method that leverages fluctuations in neural outputs to generate masks for distinguishing clean speech from noise.
It employs a WaveUnet-based encoder-decoder architecture integrating time and time-frequency domain processes to enhance signal fidelity.
Experimental evaluation on the VoiceBank-DEMAND dataset demonstrates competitive performance against both unsupervised and state-of-the-art supervised techniques.

An Unsupervised Approach to Speech Denoising Using Per-Frequency Modeling Fluctuations

The paper under discussion introduces a novel unsupervised method for speech denoising by accumulating per-frequency modeling fluctuations. This approach innovatively combines processing in both the time domain and the time-frequency domain to address the complex challenge of separating clean speech from noise. The methodology, as outlined in the paper, eschews the traditional reliance on supervised training data, instead employing a neural network model trained uniquely on the noisy audio input itself.

Main Contributions

The method utilizes convolutional neural networks, specifically adopting the WaveUnet architecture with an encoder-decoder structure. The framework posits that clean signals are better captured by the network compared to noise and leverages the fluctuating network outputs during training to generate a score per time-frequency bin. This score facilitates the construction of a mask, which delineates the likely noise-heavy portions of the signal from those more likely to be clean. Crucially, this technique does not depend on external training samples, a frequent constraint in supervised learning paradigms, thus contributing to its robustness.

Key to this approach is the iterative evaluation of network output stability, where variations in network predictions across iterations are meticulously recorded and analyzed. These variations are interpreted as indicators of noise presence, capitalized upon to refine the signal’s fidelity. The process of accumulating these fluctuations over numerous training iterations allows the extraction of a normalized mask that predicts the clean signal.

Experimental Evaluation

The authors conducted an extensive evaluation of this method against both traditional unsupervised and state-of-the-art supervised denoising techniques using the VoiceBank-DEMAND dataset. The results point to competitive performance by the proposed method, particularly excelling in scenarios void of stationary noise assumptions or when faced with unconventional noise profiles. Despite its unsupervised nature, it compares favorably with SEGAN, a supervised architecture, suggesting that model fluctuations yield results within the field of sophisticated learning techniques. Notably, it outperforms most unsupervised techniques except in cases where specific dataset assumptions allow classic methods such as MMSE-LSA to excel.

Implications and Future Directions

This unsupervised denoising method provides a compelling alternative for applications where curated datasets are unavailable or when computational resources for extensive model training are constrained. It opens pathways for future research into deep learning-based unsupervised methods that can leverage internal model dynamics rather than relying on external ground truth signals.

Given the effective handling of non-stationary noise signals, future work might explore enhancement strategies by explicitly incorporating models that address unvoiced speech segments or novel network architectures that could further distinguish between clean and noisy components dynamically. Additionally, expanding the approach to broader audio domains beyond speech, investigating musical or environmental sounds, may reveal further applications of this method.

In summary, the paper introduces a sophisticated yet versatile approach to audio denoising and serves as a basis for further investigations into unsupervised deep learning methods in signal processing. The focus on per-frequency modeling fluctuations provides a fresh perspective on utilizing neural network uncertainty as a valuable tool for signal denoising tasks.

PDF Markdown

Related Papers

GitHub

GitHub - mosheman5/DNP: Audio Denoising with Deep Network Priors (163 stars)