Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks (2403.17378v1)
Abstract: This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.
- Y. Ai and Z.-H. Ling, “Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,” in Proc. ICASSP, 2023, pp. 1–5.
- X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in Proc. Interspeech, 2013, pp. 436–440.
- Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2014.
- J. Kim, M. El-Khamy, and J. Lee, “T-GSA: Transformer with gaussian-weighted self-attention for speech enhancement,” in Proc. ICASSP, 2020, pp. 6649–6653.
- Y. Wang, S. Zhao, W. Liu, M. Li, and J. Kuang, “Speech bandwidth expansion based on deep neural networks,” in Proc. Interspeech, 2015, pp. 2593–2597.
- Y. Gu, Z.-H. Ling, and L.-R. Dai, “Speech bandwidth extension using bottleneck features and deep recurrent neural networks,” in Proc. Interspeech, 2016, pp. 297–301.
- K. Li, Z. Huang, Y. Xu, and C.-H. Lee, “DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech,” in Proc. Interspeech, 2015, pp. 2578–2582.
- H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.
- Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech, 2017, pp. 4006–4010.
- J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
- S. Takaki, H. Kameoka, and J. Yamagishi, “Direct modeling of frequency spectra and waveform generation based on phase recovery for dnn-based speech synthesis.” in Proc. Interspeech, 2017, pp. 1128–1132.
- A. Marafioti, N. Perraudin, N. Holighaus, and P. Majdak, “Adversarial generation of time-frequency features with application in audio synthesis,” in Proc. ICML, 2019, pp. 4352–4362.
- Y. Saito, S. Takamichi, and H. Saruwatari, “Text-to-speech synthesis using STFT spectra based on low-/multi-resolution generative adversarial networks,” in Proc. ICASSP, 2018, pp. 5299–5303.
- D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
- N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast Griffin-Lim algorithm,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013, pp. 1–4.
- Y. Masuyama, K. Yatabe, and Y. Oikawa, “Griffin-Lim like phase recovery via alternating direction method of multipliers,” IEEE Signal Processing Letters, vol. 26, no. 1, pp. 184–188, 2018.
- T. Kobayashi, T. Tanaka, K. Yatabe, and Y. Oikawa, “Acoustic application of phase reconstruction algorithms in optics,” in Proc. ICASSP, 2022, pp. 6212–6216.
- H. H. Bauschke, P. L. Combettes, and D. R. Luke, “Phase retrieval, error reduction algorithm, and fienup variants: a view from convex optimization,” JOSA A, vol. 19, no. 7, pp. 1334–1345, 2002.
- H. H. Bauschke, P. L. Combettes, and D. R. Luke, “Finding best approximation pairs relative to two closed convex sets in hilbert spaces,” Journal of Approximation theory, vol. 127, no. 2, pp. 178–192, 2004.
- D. R. Luke, “Relaxed averaged alternating reflections for diffraction imaging,” Inverse problems, vol. 21, no. 1, p. 37, 2004.
- J. R. Fienup, “Reconstruction of an object from the modulus of its fourier transform,” Optics letters, vol. 3, no. 1, pp. 27–29, 1978.
- J. R. Fienup, “Phase retrieval algorithms: a comparison,” Applied optics, vol. 21, no. 15, pp. 2758–2769, 1982.
- H. H. Bauschke, P. L. Combettes, and D. R. Luke, “Hybrid projection–reflection method for phase retrieval,” JOSA A, vol. 20, no. 6, pp. 1025–1034, 2003.
- K. Oyamada, H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, and H. Ando, “Generative adversarial network-based approach to signal reconstruction from magnitude spectrogram,” in Proc. EUSIPCO, 2018, pp. 2514–2518.
- Y. Masuyama, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Deep Griffin-Lim iteration,” in Proc. ICASSP, 2019, pp. 61–65.
- Y. Masuyama, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Deep Griffin-Lim iteration: Trainable iterative phase reconstruction using neural network,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 1, pp. 37–50, 2020.
- Y. Masuyama, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Phase reconstruction based on recurrent phase unwrapping with deep neural networks,” in Proc. ICASSP, 2020, pp. 826–830.
- L. Thieling, D. Wilhelm, and P. Jax, “Recurrent phase reconstruction using estimated phase derivatives from deep neural networks,” in Proc. ICASSP, 2021, pp. 7088–7092.
- N. B. Thien, Y. Wakabayashi, K. Iwai, and T. Nishiura, “Two-stage phase reconstruction using dnn and von mises distribution-based maximum likelihood,” in Proc. APSIPA, 2021, pp. 995–999.
- D. C. Ghiglia and M. D. Pritt, “Two-dimensional phase unwrapping: theory, algorithms, and software,” A Wiley Interscience Publication, 1998.
- S. Takamichi, Y. Saito, N. Takamune, D. Kitamura, and H. Saruwatari, “Phase reconstruction from amplitude spectrograms based on von-mises-distribution deep neural network,” in Proc. IWAENC, 2018, pp. 286–290.
- S. Takamichi, Y. Saito, N. Takamune, D. Kitamura, and H. Saruwatari, “Phase reconstruction from amplitude spectrograms based on directional-statistics deep neural networks,” Signal Processing, vol. 169, p. 107368, 2020.
- H. Dudley, “The vocoder,” Bell Labs Record, vol. 18, no. 4, pp. 122–126, 1939.
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. NeurIPS, 2014, pp. 2672–2680.
- Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. ICML, 2017, pp. 933–941.
- A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, no. 1, 2013, p. 3.
- C. Veaux, J. Yamagishi, K. MacDonald et al., “Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” 2016.
- Y. Ai, Z.-H. Ling, W.-L. Wu, and A. Li, “Denoising-and-dereverberation hierarchical neural vocoder for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2036–2048, 2022.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. ICLR, 2018.
- Y. Ai and Z.-H. Ling, “A neural vocoder with hierarchical generation of amplitude and phase spectra for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 839–851, 2020.
- S. Buchholz and J. Latorre, “Crowdsourcing preference tests, and how to detect cheating,” in Proc. Interspeech, 2011, pp. 3053–3056.
- Y. Ai and Z.-H. Ling, “APNet: An all-frame-level neural vocoder incorporating direct prediction of amplitude and phase spectra,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2145–2157, 2023.
- Y.-X. Lu, Y. Ai, and Z.-H. Ling, “MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,” in Proc. Interspeech, 2023, pp. 3834–3838.
- Yang Ai (41 papers)
- Zhen-Hua Ling (114 papers)