Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction (2401.06387v2)
Abstract: Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the input narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods.
- K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “A mel-cepstral analysis technique restoring high frequency components from low-sampling-rate speech,” in Proc. Interspeech, 2014, pp. 2494–2498.
- M. M. Goodarzi, F. Almasganj, J. Kabudian, Y. Shekofteh, and I. S. Rezaei, “Feature bandwidth extension for persian conversational telephone speech recognition,” in Proc. ICEE, 2012, pp. 1220–1223.
- A. Albahri, C. S. Rodriguez, and M. Lech, “Artificial bandwidth extension to improve automatic emotion recognition from narrow-band coded speech,” in Proc. ICSPCS, 2016, pp. 1–7.
- S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter, “Speech enhancement via frequency bandwidth extension using line spectral frequencies,” in Proc. ICASSP, vol. 1, 2001, pp. 665–668.
- F. Mustière, M. Bouchard, and M. Bolić, “Bandwidth extension for speech enhancement,” in Proc. CCECE, 2010, pp. 1–4.
- W. Xiao, W. Liu, M. Wang, S. Yang, Y. Shi, Y. Kang, D. Su, S. Shang, and D. Yu, “Multi-mode neural speech coding based on deep generative networks,” in Proc. Interspeech, 2023, pp. 819–823.
- J. Makhoul and M. Berouti, “High-frequency regeneration in speech coding systems,” in Proc. ICASSP, vol. 4, 1979, pp. 428–431.
- H. Carl, “Bandwidth enhancement of narrowband speech signals,” in Proc. EUSIPCO, vol. 2, 1994, pp. 1178–1181.
- J. Sadasivan, S. Mukherjee, and C. S. Seelamantula, “Joint dictionary training for bandwidth extension of speech signals,” in Proc. ICASSP, 2016, pp. 5925–5929.
- T. Unno and A. McCree, “A robust narrowband to wideband extension system featuring enhanced codebook mapping,” in Proc. ICASSP, vol. 1, 2005, pp. I–805.
- H. Pulakka, U. Remes, K. Palomäki, M. Kurimo, and P. Alku, “Speech bandwidth extension using Gaussian mixture model-based estimation of the highband mel spectrum,” in Proc. ICASSP, 2011, pp. 5100–5103.
- Y. Ohtani, M. Tamura, M. Morita, and M. Akamine, “GMM-based bandwidth extension using sub-band basis spectrum model,” in Proc. Interspeech, 2014, pp. 2489–2493.
- Y. Wang, S. Zhao, Y. Yu, and J. Kuang, “Speech bandwidth extension based on GMM and clustering method,” in Proc. CSNT, 2015, pp. 437–441.
- G. Chen and V. Parsa, “HMM-based frequency bandwidth extension for speech enhancement using line spectral frequencies,” in Proc. ICASSP, vol. 1, 2004, pp. I–709.
- P. Bauer and T. Fingscheidt, “An HMM-based artificial bandwidth extension evaluated by cross-language training and test,” in Proc. ICASSP, 2008, pp. 4589–4592.
- G.-B. Song and P. Martynovich, “A study of HMM-based bandwidth extension of speech signals,” Signal Processing, vol. 89, no. 10, pp. 2036–2044, 2009.
- Z. Yong and L. Yi, “Bandwidth extension of narrowband speech based on hidden markov model,” in Proc. ICALIP, 2014, pp. 372–376.
- Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 35–52, 2015.
- V. Kuleshov, S. Z. Enam, and S. Ermon, “Audio super-resolution using neural nets,” in Proc. ICLR (Workshop Track), 2017.
- Z.-H. Ling, Y. Ai, Y. Gu, and L.-R. Dai, “Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 5, pp. 883–894, 2018.
- S. Birnbaum, V. Kuleshov, Z. Enam, P. W. W. Koh, and S. Ermon, “Temporal FiLM: Capturing long-range sequence dependencies with feature-wise modulations.” Proc. NeurIPS, vol. 32, 2019.
- N. C. Rakotonirina, “Self-attention for audio super-resolution,” in Proc. MLSP, 2021, pp. 1–6.
- H. Wang and D. Wang, “Towards robust speech super-resolution,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2058–2066, 2021.
- J. Abel, M. Strake, and T. Fingscheidt, “A simple cepstral domain DNN approach to artificial speech bandwidth extension,” in Proc. ICASSP, 2018, pp. 5469–5473.
- K. Li and C.-H. Lee, “A deep neural network approach to speech bandwidth expansion,” in Proc. ICASSP, 2015, pp. 4395–4399.
- B. Liu, J. Tao, Z. Wen, Y. Li, and D. Bukhari, “A novel method of artificial bandwidth extension using deep architecture.” in Proc. Interspeech, 2015, pp. 2598–2602.
- Y. Gu, Z.-H. Ling, and L.-R. Dai, “Speech bandwidth extension using bottleneck features and deep recurrent neural networks,” in Proc. Interspeech, 2016, pp. 297–301.
- C. V. Botinhao, B. S. Carlos, L. P. Caloba, and M. R. Petraglia, “Frequency extension of telephone narrowband speech signal using neural networks,” in Proc. CESA, vol. 2, 2006, pp. 1576–1579.
- J. Kontio, L. Laaksonen, and P. Alku, “Neural network-based artificial bandwidth expansion of speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 873–881, 2007.
- H. Pulakka and P. Alku, “Bandwidth extension of telephone speech using a neural network and a filter bank implementation for highband mel spectrum,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2170–2183, 2011.
- H. Liu, W. Choi, X. Liu, Q. Kong, Q. Tian, and D. Wang, “Neural vocoder is all you need for speech super-resolution,” in Proc. Interspeech, 2022, pp. 4227–4231.
- M. Mandel, O. Tal, and Y. Adi, “AERO: Audio super resolution in the spectral domain,” in Proc. ICASSP, 2023, pp. 1–5.
- C. Shuai, C. Shi, L. Gan, and H. Liu, “mdctGAN: Taming transformer-based GAN for speech super-resolution with modified DCT spectra,” in Proc. Interspeech, 2023, pp. 5112–5116.
- Y. Ai and Z.-H. Ling, “Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,” in Proc. ICASSP, 2023, pp. 1–5.
- Y. Ai, Y.-X. Lu, and Z.-H. Ling, “Long-frame-shift neural speech phase prediction with spectral continuity enhancement and interpolation error compensation,” IEEE Signal Processing Letters, vol. 30, pp. 1097–1101, 2023.
- Y. Ai and Z.-H. Ling, “APNet: An all-frame-level neural vocoder incorporating direct prediction of amplitude and phase spectra,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2145–2157, 2023.
- Y.-X. Lu, Y. Ai, and Z.-H. Ling, “MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,” in Proc. Interspeech, 2023, pp. 3834–3838.
- Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proc. CVPR, 2022, pp. 11 976–11 986.
- D. Yin, C. Luo, Z. Xiong, and W. Zeng, “PHASEN: A phase-and-harmonics-aware speech enhancement network,” in Proc. AAAI, vol. 34, no. 05, 2020, pp. 9458–9465.
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Proc. NeurIPS, vol. 33, pp. 17 022–17 033, 2020.
- W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. Interspeech, 2021, pp. 2207–2211.
- O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, 2015, pp. 234–241.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NeurIPS, vol. 30, 2017.
- J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in Proc. ICML, 2015, pp. 2256–2265.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Proc. NeurIPS, vol. 33, pp. 6840–6851, 2020.
- J. Lee and S. Han, “NU-Wave: A diffusion probabilistic model for neural audio upsampling,” Proc. Interspeech, pp. 1634–1638, 2021.
- S. Han and J. Lee, “NU-Wave 2: A general neural audio upsampling model for various sampling rates,” in Proc. Interspeech, 2022, pp. 4401–4405.
- C.-Y. Yu, S.-L. Yeh, G. Fazekas, and H. Tang, “Conditioning and sampling in variational diffusion models for speech super-resolution,” in Proc. ICASSP, 2023, pp. 1–5.
- F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 162, pp. 94–114, 2020.
- Z.-Q. Wang, G. Wichern, and J. Le Roux, “On the compensation between magnitude and phase in speech separation,” IEEE Signal Processing Letters, vol. 28, pp. 2018–2022, 2021.
- H. Siuzdak, “Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” arXiv preprint arXiv:2306.00814, 2023.
- A. Gritsenko, T. Salimans, R. van den Berg, J. Snoek, and N. Kalchbrenner, “A spectral energy distance for parallel speech synthesis,” Proc. NeurIPS, vol. 33, pp. 13 062–13 072, 2020.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” stat, vol. 1050, p. 21, 2016.
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” in Proc. ICML, vol. 70, 2017, pp. 3441–3450.
- A. L. Maas, A. Y. Hannun, A. Y. Ng et al., “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, no. 1, 2013, p. 3.
- N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
- K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y. Bengio, and A. C. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in Proc. NeurIPS, vol. 32, 2019.
- J. Yamagishi, C. Veaux, K. MacDonald et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “ViSQOL v3: An open source production ready objective speech and audio metric,” in Proc. QoMEX, 2020, pp. 1–6.
- C. Veaux, J. Yamagishi, K. MacDonald et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), vol. 6, p. 15, 2017.
- H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
- E. Bakhturina, V. Lavrukhin, B. Ginsburg, and Y. Zhang, “Hi-Fi multi-speaker English TTS dataset,” in Proc. Interspeech, 2021, pp. 2776–2780.
- Ye-Xin Lu (17 papers)
- Yang Ai (41 papers)
- Hui-Peng Du (15 papers)
- Zhen-Hua Ling (114 papers)