Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction (2401.06387v2)

Published 12 Jan 2024 in eess.AS, cs.SD, and eess.SP

Abstract: Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the input narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “A mel-cepstral analysis technique restoring high frequency components from low-sampling-rate speech,” in Proc. Interspeech, 2014, pp. 2494–2498.
  2. M. M. Goodarzi, F. Almasganj, J. Kabudian, Y. Shekofteh, and I. S. Rezaei, “Feature bandwidth extension for persian conversational telephone speech recognition,” in Proc. ICEE, 2012, pp. 1220–1223.
  3. A. Albahri, C. S. Rodriguez, and M. Lech, “Artificial bandwidth extension to improve automatic emotion recognition from narrow-band coded speech,” in Proc. ICSPCS, 2016, pp. 1–7.
  4. S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter, “Speech enhancement via frequency bandwidth extension using line spectral frequencies,” in Proc. ICASSP, vol. 1, 2001, pp. 665–668.
  5. F. Mustière, M. Bouchard, and M. Bolić, “Bandwidth extension for speech enhancement,” in Proc. CCECE, 2010, pp. 1–4.
  6. W. Xiao, W. Liu, M. Wang, S. Yang, Y. Shi, Y. Kang, D. Su, S. Shang, and D. Yu, “Multi-mode neural speech coding based on deep generative networks,” in Proc. Interspeech, 2023, pp. 819–823.
  7. J. Makhoul and M. Berouti, “High-frequency regeneration in speech coding systems,” in Proc. ICASSP, vol. 4, 1979, pp. 428–431.
  8. H. Carl, “Bandwidth enhancement of narrowband speech signals,” in Proc. EUSIPCO, vol. 2, 1994, pp. 1178–1181.
  9. J. Sadasivan, S. Mukherjee, and C. S. Seelamantula, “Joint dictionary training for bandwidth extension of speech signals,” in Proc. ICASSP, 2016, pp. 5925–5929.
  10. T. Unno and A. McCree, “A robust narrowband to wideband extension system featuring enhanced codebook mapping,” in Proc. ICASSP, vol. 1, 2005, pp. I–805.
  11. H. Pulakka, U. Remes, K. Palomäki, M. Kurimo, and P. Alku, “Speech bandwidth extension using Gaussian mixture model-based estimation of the highband mel spectrum,” in Proc. ICASSP, 2011, pp. 5100–5103.
  12. Y. Ohtani, M. Tamura, M. Morita, and M. Akamine, “GMM-based bandwidth extension using sub-band basis spectrum model,” in Proc. Interspeech, 2014, pp. 2489–2493.
  13. Y. Wang, S. Zhao, Y. Yu, and J. Kuang, “Speech bandwidth extension based on GMM and clustering method,” in Proc. CSNT, 2015, pp. 437–441.
  14. G. Chen and V. Parsa, “HMM-based frequency bandwidth extension for speech enhancement using line spectral frequencies,” in Proc. ICASSP, vol. 1, 2004, pp. I–709.
  15. P. Bauer and T. Fingscheidt, “An HMM-based artificial bandwidth extension evaluated by cross-language training and test,” in Proc. ICASSP, 2008, pp. 4589–4592.
  16. G.-B. Song and P. Martynovich, “A study of HMM-based bandwidth extension of speech signals,” Signal Processing, vol. 89, no. 10, pp. 2036–2044, 2009.
  17. Z. Yong and L. Yi, “Bandwidth extension of narrowband speech based on hidden markov model,” in Proc. ICALIP, 2014, pp. 372–376.
  18. Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 35–52, 2015.
  19. V. Kuleshov, S. Z. Enam, and S. Ermon, “Audio super-resolution using neural nets,” in Proc. ICLR (Workshop Track), 2017.
  20. Z.-H. Ling, Y. Ai, Y. Gu, and L.-R. Dai, “Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 5, pp. 883–894, 2018.
  21. S. Birnbaum, V. Kuleshov, Z. Enam, P. W. W. Koh, and S. Ermon, “Temporal FiLM: Capturing long-range sequence dependencies with feature-wise modulations.” Proc. NeurIPS, vol. 32, 2019.
  22. N. C. Rakotonirina, “Self-attention for audio super-resolution,” in Proc. MLSP, 2021, pp. 1–6.
  23. H. Wang and D. Wang, “Towards robust speech super-resolution,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2058–2066, 2021.
  24. J. Abel, M. Strake, and T. Fingscheidt, “A simple cepstral domain DNN approach to artificial speech bandwidth extension,” in Proc. ICASSP, 2018, pp. 5469–5473.
  25. K. Li and C.-H. Lee, “A deep neural network approach to speech bandwidth expansion,” in Proc. ICASSP, 2015, pp. 4395–4399.
  26. B. Liu, J. Tao, Z. Wen, Y. Li, and D. Bukhari, “A novel method of artificial bandwidth extension using deep architecture.” in Proc. Interspeech, 2015, pp. 2598–2602.
  27. Y. Gu, Z.-H. Ling, and L.-R. Dai, “Speech bandwidth extension using bottleneck features and deep recurrent neural networks,” in Proc. Interspeech, 2016, pp. 297–301.
  28. C. V. Botinhao, B. S. Carlos, L. P. Caloba, and M. R. Petraglia, “Frequency extension of telephone narrowband speech signal using neural networks,” in Proc. CESA, vol. 2, 2006, pp. 1576–1579.
  29. J. Kontio, L. Laaksonen, and P. Alku, “Neural network-based artificial bandwidth expansion of speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 873–881, 2007.
  30. H. Pulakka and P. Alku, “Bandwidth extension of telephone speech using a neural network and a filter bank implementation for highband mel spectrum,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2170–2183, 2011.
  31. H. Liu, W. Choi, X. Liu, Q. Kong, Q. Tian, and D. Wang, “Neural vocoder is all you need for speech super-resolution,” in Proc. Interspeech, 2022, pp. 4227–4231.
  32. M. Mandel, O. Tal, and Y. Adi, “AERO: Audio super resolution in the spectral domain,” in Proc. ICASSP, 2023, pp. 1–5.
  33. C. Shuai, C. Shi, L. Gan, and H. Liu, “mdctGAN: Taming transformer-based GAN for speech super-resolution with modified DCT spectra,” in Proc. Interspeech, 2023, pp. 5112–5116.
  34. Y. Ai and Z.-H. Ling, “Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,” in Proc. ICASSP, 2023, pp. 1–5.
  35. Y. Ai, Y.-X. Lu, and Z.-H. Ling, “Long-frame-shift neural speech phase prediction with spectral continuity enhancement and interpolation error compensation,” IEEE Signal Processing Letters, vol. 30, pp. 1097–1101, 2023.
  36. Y.  Ai and Z.-H. Ling, “APNet: An all-frame-level neural vocoder incorporating direct prediction of amplitude and phase spectra,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2145–2157, 2023.
  37. Y.-X. Lu, Y. Ai, and Z.-H. Ling, “MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,” in Proc. Interspeech, 2023, pp. 3834–3838.
  38. Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proc. CVPR, 2022, pp. 11 976–11 986.
  39. D. Yin, C. Luo, Z. Xiong, and W. Zeng, “PHASEN: A phase-and-harmonics-aware speech enhancement network,” in Proc. AAAI, vol. 34, no. 05, 2020, pp. 9458–9465.
  40. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Proc. NeurIPS, vol. 33, pp. 17 022–17 033, 2020.
  41. W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. Interspeech, 2021, pp. 2207–2211.
  42. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, 2015, pp. 234–241.
  43. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NeurIPS, vol. 30, 2017.
  44. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in Proc. ICML, 2015, pp. 2256–2265.
  45. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Proc. NeurIPS, vol. 33, pp. 6840–6851, 2020.
  46. J. Lee and S. Han, “NU-Wave: A diffusion probabilistic model for neural audio upsampling,” Proc. Interspeech, pp. 1634–1638, 2021.
  47. S. Han and J. Lee, “NU-Wave 2: A general neural audio upsampling model for various sampling rates,” in Proc. Interspeech, 2022, pp. 4401–4405.
  48. C.-Y. Yu, S.-L. Yeh, G. Fazekas, and H. Tang, “Conditioning and sampling in variational diffusion models for speech super-resolution,” in Proc. ICASSP, 2023, pp. 1–5.
  49. F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 162, pp. 94–114, 2020.
  50. Z.-Q. Wang, G. Wichern, and J. Le Roux, “On the compensation between magnitude and phase in speech separation,” IEEE Signal Processing Letters, vol. 28, pp. 2018–2022, 2021.
  51. H. Siuzdak, “Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” arXiv preprint arXiv:2306.00814, 2023.
  52. A. Gritsenko, T. Salimans, R. van den Berg, J. Snoek, and N. Kalchbrenner, “A spectral energy distance for parallel speech synthesis,” Proc. NeurIPS, vol. 33, pp. 13 062–13 072, 2020.
  53. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” stat, vol. 1050, p. 21, 2016.
  54. D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” in Proc. ICML, vol. 70, 2017, pp. 3441–3450.
  55. A. L. Maas, A. Y. Hannun, A. Y. Ng et al., “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, no. 1, 2013, p. 3.
  56. N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  57. K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y. Bengio, and A. C. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in Proc. NeurIPS, vol. 32, 2019.
  58. J. Yamagishi, C. Veaux, K. MacDonald et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
  59. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  60. M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “ViSQOL v3: An open source production ready objective speech and audio metric,” in Proc. QoMEX, 2020, pp. 1–6.
  61. C. Veaux, J. Yamagishi, K. MacDonald et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), vol. 6, p. 15, 2017.
  62. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
  63. E. Bakhturina, V. Lavrukhin, B. Ginsburg, and Y. Zhang, “Hi-Fi multi-speaker English TTS dataset,” in Proc. Interspeech, 2021, pp. 2776–2780.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ye-Xin Lu (17 papers)
  2. Yang Ai (41 papers)
  3. Hui-Peng Du (15 papers)
  4. Zhen-Hua Ling (114 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.