- The paper theoretically demonstrates that Least Squares GAN training implicitly estimates the most probable clean speech given a degraded input, aligning with speech enhancement goals.
- The novel FINALLY model enhances the HiFi++ architecture by integrating WavLM-based perceptual loss and MS-STFT adversarial training for high-quality 48 kHz universal speech enhancement.
- FINALLY achieves competitive perceptual quality and lower phoneme error rates compared to diffusion and regression baselines in subjective and objective evaluations.
The study addresses universal speech enhancement for real-world audio, marked by noise, reverberation, and microphone artifacts. It posits that Generative Adversarial Networks (GANs) inherently seek maximum density within the conditional clean speech distribution, aligning with speech enhancement objectives. The authors probe feature extractors for perceptual loss to stabilize adversarial training, incorporating WavLM-based perceptual loss into an MS-STFT adversarial training pipeline.
Key contributions include:
- Theoretical analysis demonstrating that Least Squares GAN (LS-GAN) training implicitly regresses for the main mode of the conditional distribution pclean(y∣x)​, where y is clean speech and x is the degraded version, suiting speech enhancement.
- Criteria for feature extractor selection based on feature space structure, validated through neural vocoding, highlighting convolutional features of WavLM for perceptual loss.
- A novel universal speech enhancement model, FINALLY, integrating perceptual loss with MS-STFT discriminator training, enhancing the HiFi++ generator architecture with a WavLM encoder, yielding high-quality speech at 48 kHz.
The paper addresses the practical goal of speech enhancement models, which is to restore audio signals containing speech characteristics of the original recording. The paper argues that this is more of a "refinement" task rather than a "generative" task. From a mathematical point of view, this means that the speech enhancement model should retrieve the most probable reconstruction of the clean speech y given the corrupted version x, i.e., y=argmaxy​ pclean(y∣x)​.
The study presents a theorem formalizing this notion:
Let pclean(y∣x)>0​ be a finite and Lipschitz continuous density function with a unique global maximum and pgξ​(y∣x)=ξn/2n⋅1y−gθ​(x)∈[−1/ξ,1/ξ]n​, then
$\lim_{\xi \rightarrow +\infty} \underset{g_\theta(x)}{\mathrm{arg\,min} \ \chi^2_{\text{Pearson} ( p_g^\xi || (p_{\text{clean} + p_g^\xi) / 2 ) = \underset{y}{\mathrm{arg\,max}\ p_{\text{clean}(y|x)}}$
- pclean(y∣x)​ is the conditional probability density function of clean speech y given degraded speech x
- pgξ​(y∣x) is a family of waveform distributions produced by the generator gθ​(x).
- ξ is a parameter that influences the shape of the function
- n is the number of dimensions of y
- gθ​(x) is the generator, where θ is the parameter of the generator
The paper introduces clustering and Signal-to-Noise Ratio (SNR) rules for assessing regression spaces, enhancing speech enhancement models. The clustering rule dictates that representations of identical sounds should cluster distinctly, while the SNR rule requires representations of noisy speech to deviate monotonically from clean clusters with increasing noise.
The FINALLY model builds on HiFi++, incorporating a WavLM-large model output as input to the Upsampler and introducing an Upsample WaveUNet for 48 kHz output from a 16 kHz input. Training occurs in three stages: content restoration, adversarial training with MS-STFT discriminators, and aesthetic enhancement using human feedback metrics like UTMOS and PESQ.
The model was evaluated against diffusion models such as BBED, STORM, and UNIVERSE, as well as regression models such as Voicefixer and DEMUCS. FINALLY achieved competitive perceptual quality and lower Phoneme Error Rate (PhER) than UNIVERSE. On the VCTK-DEMAND dataset, FINALLY outperformed baselines in subjective evaluation. Ablation studies confirmed the effectiveness of LMOS loss, the WavLM encoder, and multi-stage training.