Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models (2010.11860v1)

Published 22 Oct 2020 in eess.AS and cs.SD

Abstract: Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.

Authors (3)

Saurabh Kataria (23 papers)
Najim Dehak (71 papers)
Jesús Villalba (29 papers)

Citations (32)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models (2010.11860v1)

Summary

Related Papers