2000 character limit reached
HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement (2203.13086v4)
Published 24 Mar 2022 in cs.SD, cs.LG, and eess.AS
Abstract: Generative adversarial networks have recently demonstrated outstanding performance in neural vocoding outperforming best autoregressive and flow-based models. In this paper, we show that this success can be extended to other tasks of conditional audio generation. In particular, building upon HiFi vocoders, we propose a novel HiFi++ general framework for bandwidth extension and speech enhancement. We show that with the improved generator architecture, HiFi++ performs better or comparably with the state-of-the-art in these tasks while spending significantly less computational resources. The effectiveness of our approach is validated through a series of extensive experiments.
- “Melgan: Generative adversarial networks for conditional waveform synthesis,” arXiv preprint arXiv:1910.06711, 2019.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” arXiv preprint arXiv:2010.05646, 2020.
- “Voicefixer: A unified framework for high-fidelity speech restoration,” arXiv preprint arXiv:2204.05841, 2022.
- “A two-stage approach to speech bandwidth extension,” Proc. Interspeech 2021, pp. 1689–1693, 2021.
- “Towards robust speech super-resolution,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
- “Metricgan+: An improved version of metricgan for speech enhancement,” arXiv preprint arXiv:2104.03538, 2021.
- “Seanet: A multi-modal speech enhancement network,” arXiv preprint arXiv:2009.02095, 2020.
- “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
- “Differentiable consistency constraints for improved deep speech enhancement,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 900–904.
- “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802.
- “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” 2019.
- Cassia Valentini-Botinhao et al., “Noisy speech database for training speech enhancement algorithms and tts models,” 2017.
- “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221). IEEE, 2001, vol. 2, pp. 749–752.
- “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
- “Sdr–half-baked or well done?,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.
- “Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 886–890.
- “Real-time speech frequency bandwidth extension,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 691–695.
- “Temporal film: Capturing long-range sequence dependencies with feature-wise modulations,” arXiv preprint arXiv:1909.06628, 2019.
- “Dual-branch attention-in-attention transformer for single-channel speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7847–7851.
- “Mosnet: Deep learning based objective assessment for voice conversion,” arXiv preprint arXiv:1904.08352, 2019.
- “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXiv preprint arXiv:2006.11477, 2020.
- “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” arXiv preprint arXiv:1804.04262, 2018.
- “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6493–6497.
- “SE-Conformer: Time-Domain Speech Enhancement Using Conformer,” in Proc. Interspeech 2021, 2021, pp. 2736–2740.
- “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621.
- “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019.
- Pavel Andreev (13 papers)
- Aibek Alanov (20 papers)
- Oleg Ivanov (5 papers)
- Dmitry Vetrov (84 papers)