Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement (2010.15174v3)

Published 28 Oct 2020 in cs.SD, cs.LG, and eess.AS

Abstract: Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model, a powerful self-supervised encoder that renders rich phonetic information. To more accurately measure the distribution distances of the latent representations, the PFPL adopts the Wasserstein distance as the distance measure. Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses. Moreover, the results showed that the PFPL can enable a deep complex U-Net SE model to achieve highly competitive performance in terms of standardized quality and intelligibility evaluations on the Voice Bank-DEMAND dataset.

Citations (16)

Summary

We haven't generated a summary for this paper yet.