HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement (2203.13086v4)

Published 24 Mar 2022 in cs.SD, cs.LG, and eess.AS

Abstract: Generative adversarial networks have recently demonstrated outstanding performance in neural vocoding outperforming best autoregressive and flow-based models. In this paper, we show that this success can be extended to other tasks of conditional audio generation. In particular, building upon HiFi vocoders, we propose a novel HiFi++ general framework for bandwidth extension and speech enhancement. We show that with the improved generator architecture, HiFi++ performs better or comparably with the state-of-the-art in these tasks while spending significantly less computational resources. The effectiveness of our approach is validated through a series of extensive experiments.

References (27)

Authors (4)

Pavel Andreev (13 papers)
Aibek Alanov (20 papers)
Oleg Ivanov (5 papers)
Dmitry Vetrov (84 papers)

Citations (36)

View on Semantic Scholar

Summary

The paper introduces the HiFi++ framework, which unifies bandwidth extension and speech enhancement using a GAN-based architecture.
It employs SpectralUNet, WaveUNet, and SpectralMaskNet modules for efficient waveform reconstruction and noise reduction.
Extensive evaluations using PESQ, STOI, and MOS tests confirm its superior performance despite lower computational requirements.

Overview of "HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement"

The paper "HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement" presents a sophisticated approach to conditional speech generation using advancements introduced by generative adversarial networks (GANs). By leveraging the HiFi-GAN framework, the authors propose HiFi++, an architecture designed to tackle bandwidth extension (BWE) and speech enhancement (SE) more efficiently. The intriguing aspect of the HiFi++ framework is its ability to maintain performance levels comparable to existing state-of-the-art models while reducing computational resource demands significantly.

Key Contributions

At the core of this research lies the novel architectural design of the HiFi++ generator, which is an adaption of the HiFi-GAN framework. The following modules are incorporated:

SpectralUNet: Targets the initial mel-spectrogram input to ease the transformation into a waveform, utilizing a UNet-like architecture specifically designed for two-dimensional convolutional operations.
WaveUNet: Operates directly on the time domain, post-processing the output of the HiFi-GAN to merge the predicted and source waveforms seamlessly.
SpectralMaskNet: Applies frequency-domain post-processing, which utilizes learnable spectral masking techniques aimed at removing noise and artifacts from the generated waveform.

The integration of these modules allows HiFi++ to address the intricacies and challenges of BWE and SE with surprising computational efficacy.

Methodology and Experimental Analysis

The authors utilize a multi-discriminator adversarial training setup, which includes LS-GAN loss, feature matching loss, and mel-spectrogram loss to refine their models. This sophisticated loss combination supports learning in the complex tasks of bandwidth extension and speech enhancement by fostering higher-fidelity reconstructions in audio signal restoration tasks.

Their model evaluation, conducted through a spectrum of objective and subjective metrics such as PESQ, STOI, and DNSMOS, alongside crowd-sourced mean opinion score (MOS) tests, provides ample evidence of HiFi++'s effectiveness. Notably, HiFi++ shows robust results across various bandwidth constraints (1 kHz, 2 kHz, and 4 kHz) and in speech enhancement scenarios, achieving favorable comparisons against competing models like VoiceFixer and SEANet.

The computational efficiency is one of the most compelling outcomes, as the HiFi++ is shown to utilize resources far more efficiently than its larger counterparts with similar or even superior qualitative outputs.

Implications and Future Directions

The results underscore the potential of GAN-based frameworks in audio-signal processing applications, affording practical benefits across telecommunications and media broadcasts, among others. The HiFi++ design optimizes the trade-off between computational overhead and model performance, suggesting broader applicability in real-time systems where resource constraints are critical.

Theoretically, the modular structuring of architectures like HiFi++ presents promising avenues for further research into conditional audio generation. The integration of advanced preprocessing and postprocessing modules indicates an interesting direction to experiment with more adaptive and context-aware networks, potentially expanding GANs' usability beyond current horizons.

Future work could expand these paradigms into lower-resource adaptation, cross-language applications, and integration with other forms of multimedia enhancement, ensuring that HiFi++ and its successors continue to push forward the boundaries of what is feasible in speech technology and AI-driven audio processing. Additionally, exploring more robust training frameworks and alternative GAN configurations may yield even more efficient models and state-of-the-art outcomes.

PDF Markdown

Related Papers

GitHub

GitHub - AndreevP/wvmos: MOS score prediction by fine-tuned wav2vec2.0 model (161 stars)