FINALLY: fast and universal speech enhancement with studio-like quality (2410.05920v3)

Published 8 Oct 2024 in cs.SD, cs.AI, and eess.AS

Abstract: In this paper, we address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion, such as background noise, reverberation, and microphone artifacts. We revisit the use of Generative Adversarial Networks (GANs) for speech enhancement and theoretically show that GANs are naturally inclined to seek the point of maximum density within the conditional clean speech distribution, which, as we argue, is essential for the speech enhancement task. We study various feature extractors for perceptual loss to facilitate the stability of adversarial training, developing a methodology for probing the structure of the feature space. This leads us to integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model. The resulting speech enhancement model, which we refer to as FINALLY, builds upon the HiFi++ architecture, augmented with a WavLM encoder and a novel training pipeline. Empirical results on various datasets confirm our model's ability to produce clear, high-quality speech at 48 kHz, achieving state-of-the-art performance in the field of speech enhancement. Demo page: https://samsunglabs.github.io/FINALLY-page

Citations (1)

View on Semantic Scholar

Summary

The paper theoretically demonstrates that Least Squares GAN training implicitly estimates the most probable clean speech given a degraded input, aligning with speech enhancement goals.
The novel FINALLY model enhances the HiFi++ architecture by integrating WavLM-based perceptual loss and MS-STFT adversarial training for high-quality 48 kHz universal speech enhancement.
FINALLY achieves competitive perceptual quality and lower phoneme error rates compared to diffusion and regression baselines in subjective and objective evaluations.

The study addresses universal speech enhancement for real-world audio, marked by noise, reverberation, and microphone artifacts. It posits that Generative Adversarial Networks (GANs) inherently seek maximum density within the conditional clean speech distribution, aligning with speech enhancement objectives. The authors probe feature extractors for perceptual loss to stabilize adversarial training, incorporating WavLM-based perceptual loss into an MS-STFT adversarial training pipeline.

Key contributions include:

Theoretical analysis demonstrating that Least Squares GAN (LS-GAN) training implicitly regresses for the main mode of the conditional distribution $p_{\text{clean}(y|x)}$ , where $y$ is clean speech and $x$ is the degraded version, suiting speech enhancement.
Criteria for feature extractor selection based on feature space structure, validated through neural vocoding, highlighting convolutional features of WavLM for perceptual loss.
A novel universal speech enhancement model, FINALLY, integrating perceptual loss with MS-STFT discriminator training, enhancing the HiFi++ generator architecture with a WavLM encoder, yielding high-quality speech at 48 kHz.

The paper addresses the practical goal of speech enhancement models, which is to restore audio signals containing speech characteristics of the original recording. The paper argues that this is more of a "refinement" task rather than a "generative" task. From a mathematical point of view, this means that the speech enhancement model should retrieve the most probable reconstruction of the clean speech $y$ given the corrupted version $x$ , i.e., $y = \mathrm{arg\,max}_y \ p_{\text{clean}(y|x) }$ .

The study presents a theorem formalizing this notion:

Let $p_{\text{clean}(y|x) > 0}$ be a finite and Lipschitz continuous density function with a unique global maximum and $p^{\xi}_g(y|x) = \xi^n / 2^n \cdot \mathbf{1}_{y - g_\theta(x) \in [ - 1/\xi, 1/\xi]^n}$ , then

$\lim_{\xi \rightarrow +\infty} \underset{g_\theta(x)}{\mathrm{arg\,min} \ \chi^2_{\text{Pearson} ( p_g^\xi || (p_{\text{clean} + p_g^\xi) / 2 ) = \underset{y}{\mathrm{arg\,max}\ p_{\text{clean}(y|x)}}$

$p_{\text{clean}(y|x)}$ is the conditional probability density function of clean speech $y$ given degraded speech $x$
$p^{\xi}_g(y|x)$ is a family of waveform distributions produced by the generator $g_\theta(x)$ .
$\xi$ is a parameter that influences the shape of the function
$n$ is the number of dimensions of $y$
$g_\theta(x)$ is the generator, where $\theta$ is the parameter of the generator

The paper introduces clustering and Signal-to-Noise Ratio (SNR) rules for assessing regression spaces, enhancing speech enhancement models. The clustering rule dictates that representations of identical sounds should cluster distinctly, while the SNR rule requires representations of noisy speech to deviate monotonically from clean clusters with increasing noise.

The FINALLY model builds on HiFi++, incorporating a WavLM-large model output as input to the Upsampler and introducing an Upsample WaveUNet for 48 kHz output from a 16 kHz input. Training occurs in three stages: content restoration, adversarial training with MS-STFT discriminators, and aesthetic enhancement using human feedback metrics like UTMOS and PESQ.

The model was evaluated against diffusion models such as BBED, STORM, and UNIVERSE, as well as regression models such as Voicefixer and DEMUCS. FINALLY achieved competitive perceptual quality and lower Phoneme Error Rate (PhER) than UNIVERSE. On the VCTK-DEMAND dataset, FINALLY outperformed baselines in subjective evaluation. Ablation studies confirmed the effectiveness of LMOS loss, the WavLM encoder, and multi-stage training.