Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement (2203.13086v4)

Published 24 Mar 2022 in cs.SD, cs.LG, and eess.AS

Abstract: Generative adversarial networks have recently demonstrated outstanding performance in neural vocoding outperforming best autoregressive and flow-based models. In this paper, we show that this success can be extended to other tasks of conditional audio generation. In particular, building upon HiFi vocoders, we propose a novel HiFi++ general framework for bandwidth extension and speech enhancement. We show that with the improved generator architecture, HiFi++ performs better or comparably with the state-of-the-art in these tasks while spending significantly less computational resources. The effectiveness of our approach is validated through a series of extensive experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “Melgan: Generative adversarial networks for conditional waveform synthesis,” arXiv preprint arXiv:1910.06711, 2019.
  2. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” arXiv preprint arXiv:2010.05646, 2020.
  3. “Voicefixer: A unified framework for high-fidelity speech restoration,” arXiv preprint arXiv:2204.05841, 2022.
  4. “A two-stage approach to speech bandwidth extension,” Proc. Interspeech 2021, pp. 1689–1693, 2021.
  5. “Towards robust speech super-resolution,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
  6. “Metricgan+: An improved version of metricgan for speech enhancement,” arXiv preprint arXiv:2104.03538, 2021.
  7. “Seanet: A multi-modal speech enhancement network,” arXiv preprint arXiv:2009.02095, 2020.
  8. “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
  9. “Differentiable consistency constraints for improved deep speech enhancement,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 900–904.
  10. “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802.
  11. “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” 2019.
  12. Cassia Valentini-Botinhao et al., “Noisy speech database for training speech enhancement algorithms and tts models,” 2017.
  13. “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221). IEEE, 2001, vol. 2, pp. 749–752.
  14. “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
  15. “Sdr–half-baked or well done?,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.
  16. “Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 886–890.
  17. “Real-time speech frequency bandwidth extension,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 691–695.
  18. “Temporal film: Capturing long-range sequence dependencies with feature-wise modulations,” arXiv preprint arXiv:1909.06628, 2019.
  19. “Dual-branch attention-in-attention transformer for single-channel speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7847–7851.
  20. “Mosnet: Deep learning based objective assessment for voice conversion,” arXiv preprint arXiv:1904.08352, 2019.
  21. “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
  22. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXiv preprint arXiv:2006.11477, 2020.
  23. “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” arXiv preprint arXiv:1804.04262, 2018.
  24. “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6493–6497.
  25. “SE-Conformer: Time-Domain Speech Enhancement Using Conformer,” in Proc. Interspeech 2021, 2021, pp. 2736–2740.
  26. “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621.
  27. “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Pavel Andreev (13 papers)
  2. Aibek Alanov (20 papers)
  3. Oleg Ivanov (5 papers)
  4. Dmitry Vetrov (84 papers)
Citations (36)

Summary

  • The paper introduces the HiFi++ framework, which unifies bandwidth extension and speech enhancement using a GAN-based architecture.
  • It employs SpectralUNet, WaveUNet, and SpectralMaskNet modules for efficient waveform reconstruction and noise reduction.
  • Extensive evaluations using PESQ, STOI, and MOS tests confirm its superior performance despite lower computational requirements.

Overview of "HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement"

The paper "HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement" presents a sophisticated approach to conditional speech generation using advancements introduced by generative adversarial networks (GANs). By leveraging the HiFi-GAN framework, the authors propose HiFi++, an architecture designed to tackle bandwidth extension (BWE) and speech enhancement (SE) more efficiently. The intriguing aspect of the HiFi++ framework is its ability to maintain performance levels comparable to existing state-of-the-art models while reducing computational resource demands significantly.

Key Contributions

At the core of this research lies the novel architectural design of the HiFi++ generator, which is an adaption of the HiFi-GAN framework. The following modules are incorporated:

  • SpectralUNet: Targets the initial mel-spectrogram input to ease the transformation into a waveform, utilizing a UNet-like architecture specifically designed for two-dimensional convolutional operations.
  • WaveUNet: Operates directly on the time domain, post-processing the output of the HiFi-GAN to merge the predicted and source waveforms seamlessly.
  • SpectralMaskNet: Applies frequency-domain post-processing, which utilizes learnable spectral masking techniques aimed at removing noise and artifacts from the generated waveform.

The integration of these modules allows HiFi++ to address the intricacies and challenges of BWE and SE with surprising computational efficacy.

Methodology and Experimental Analysis

The authors utilize a multi-discriminator adversarial training setup, which includes LS-GAN loss, feature matching loss, and mel-spectrogram loss to refine their models. This sophisticated loss combination supports learning in the complex tasks of bandwidth extension and speech enhancement by fostering higher-fidelity reconstructions in audio signal restoration tasks.

Their model evaluation, conducted through a spectrum of objective and subjective metrics such as PESQ, STOI, and DNSMOS, alongside crowd-sourced mean opinion score (MOS) tests, provides ample evidence of HiFi++'s effectiveness. Notably, HiFi++ shows robust results across various bandwidth constraints (1 kHz, 2 kHz, and 4 kHz) and in speech enhancement scenarios, achieving favorable comparisons against competing models like VoiceFixer and SEANet.

The computational efficiency is one of the most compelling outcomes, as the HiFi++ is shown to utilize resources far more efficiently than its larger counterparts with similar or even superior qualitative outputs.

Implications and Future Directions

The results underscore the potential of GAN-based frameworks in audio-signal processing applications, affording practical benefits across telecommunications and media broadcasts, among others. The HiFi++ design optimizes the trade-off between computational overhead and model performance, suggesting broader applicability in real-time systems where resource constraints are critical.

Theoretically, the modular structuring of architectures like HiFi++ presents promising avenues for further research into conditional audio generation. The integration of advanced preprocessing and postprocessing modules indicates an interesting direction to experiment with more adaptive and context-aware networks, potentially expanding GANs' usability beyond current horizons.

Future work could expand these paradigms into lower-resource adaptation, cross-language applications, and integration with other forms of multimedia enhancement, ensuring that HiFi++ and its successors continue to push forward the boundaries of what is feasible in speech technology and AI-driven audio processing. Additionally, exploring more robust training frameworks and alternative GAN configurations may yield even more efficient models and state-of-the-art outcomes.

Github Logo Streamline Icon: https://streamlinehq.com