VoiceFixer: Toward General Speech Restoration with Neural Vocoder (2109.13731v3)

Published 28 Sep 2021 in cs.SD and eess.AS

Abstract: Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on single-task speech restoration (SSR), such as speech denoising or speech declipping. However, SSR systems only focus on one task and do not address the general speech restoration problem. In addition, previous SSR systems show limited performance in some speech restoration tasks such as speech super-resolution. To overcome those limitations, we propose a general speech restoration (GSR) task that attempts to remove multiple distortions simultaneously. Furthermore, we propose VoiceFixer, a generative framework to address the GSR task. VoiceFixer consists of an analysis stage and a synthesis stage to mimic the speech analysis and comprehension of the human auditory system. We employ a ResUNet to model the analysis stage and a neural vocoder to model the synthesis stage. We evaluate VoiceFixer with additive noise, room reverberation, low-resolution, and clipping distortions. Our baseline GSR model achieves a 0.499 higher mean opinion score (MOS) than the speech enhancement SSR model. VoiceFixer further surpasses the GSR baseline model on the MOS score by 0.256. Moreover, we observe that VoiceFixer generalizes well to severely degraded real speech recordings, indicating its potential in restoring old movies and historical speeches. The source code is available at https://github.com/haoheliu/voicefixer_main.

Citations (52)

View on Semantic Scholar

Summary

The paper presents a two-stage framework using ResUNet for analysis and TFGAN-based synthesis to effectively remove multiple speech distortions.
Methodologically, VoiceFixer leverages a mel spectrogram representation and combined adversarial and reconstruction losses, improving MOS by 0.256.
The approach outperforms traditional single-task restoration models, paving the way for enhanced audio restoration in archival and telecommunications applications.

An Examination of VoiceFixer: General Speech Restoration via Neural Vocoder

The paper presents a novel approach to speech restoration, addressing the generalized problem of simultaneous removal of multiple speech distortions. This task, termed General Speech Restoration (GSR), diverges from the traditional Single Task Speech Restoration (SSR) methods which focus on singular distortions. This shift aims to tackle the complex reality where multiple distortions coexist, rendering SSR methods as insufficient. The proposed solution, VoiceFixer, is presented as a two-stage generative framework, which significantly advances the field by utilizing a neural vocoder.

Methodological Advancements

VoiceFixer employs an architecture inspired by human auditory processing, involving an analysis stage modeled by a ResUNet and an overview stage powered by a neural vocoder. The analysis stage maps distorted speech to a mel spectrogram representation, facilitating dimensional reduction while retaining vital audio features. This stage is optimized using mean absolute error (MAE) loss. In the synthesis stage, the vocoder, based on TFGAN, translates the mel spectrogram back into time-domain signals, leveraging adversarial and reconstruction losses to enhance perceptual quality.

Numerical Outcomes

When evaluated across various metrics such as Mean Opinion Score (MOS), log-spectral distance (LSD), and scale-invariant spectrogram to noise ratio (SiSPNR), VoiceFixer demonstrates superior performance. The MOS results, in particular, exhibit a remarkable increase of 0.256 over the GSR baseline, highlighting its efficacy. Additionally, the Oracle-Mel vocoder scores close to the target MOS, indicating the robustness of the synthesis stage.

Comparative Performance

VoiceFixer's capability to effectively handle degradations such as noise, reverberation, low resolution, and clipping, simultaneously sets it apart from existing SSR models. For super-resolution tasks, it significantly outperforms state-of-the-art models like NuWave and SEANet in low sampling rate restoration scenarios. These results suggest that the choice of a two-stage framework with a neural vocoder provides a tangible improvement over previous one-stage systems.

Implications and Future Directions

The implications of VoiceFixer's performance are profound, suggesting utility in applications where archival and historically significant audio restoration is crucial. By improving speech intelligibility and quality, this research opens pathways for better audio experiences in telecommunications and media restoration.

From a theoretical standpoint, the success of VoiceFixer underscores the potential of incorporating bio-inspired two-stage processing in AI-driven speech tasks. The distinct separation of analysis and synthesis offers flexibility, allowing independent optimization that could be expanded to broader audio processing tasks, including music restoration.

Looking forward, enhancing the generalization ability of such systems could unify more complex distortion types, further aligning restored speech quality with human auditory expectations across diverse environments. The exploration of additional discriminators or losses in the vocoder training process would likely refine perceptual quality, addressing the nuanced listener feedback identified in user studies.

PDF Markdown

Related Papers

GitHub

GitHub - haoheliu/voicefixer_main: General Speech Restoration (280 stars)
GitHub - haoheliu/voicefixer: General Speech Restoration (937 stars)