Posterior sampling algorithms for unsupervised speech enhancement with recurrent variational autoencoder (2309.10439v1)
Abstract: In this paper, we address the unsupervised speech enhancement problem based on recurrent variational autoencoder (RVAE). This approach offers promising generalization performance over the supervised counterpart. Nevertheless, the involved iterative variational expectation-maximization (VEM) process at test time, which relies on a variational inference method, results in high computational complexity. To tackle this issue, we present efficient sampling techniques based on Langevin dynamics and Metropolis-Hasting algorithms, adapted to the EM-based speech enhancement with RVAE. By directly sampling from the intractable posterior distribution within the EM process, we circumvent the intricacies of variational inference. We conduct a series of experiments, comparing the proposed methods with VEM and a state-of-the-art supervised speech enhancement approach based on diffusion models. The results reveal that our sampling-based algorithms significantly outperform VEM, not only in terms of computational efficiency but also in overall performance. Furthermore, when compared to the supervised baseline, our methods showcase robust generalization performance in mismatched test conditions.
- Audio source separation and speech enhancement, John Wiley & Sons, 2018.
- “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
- “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2014.
- “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- “Cold diffusion for speech enhancement,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Conditional diffusion probabilistic model for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7402–7406.
- “Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
- “Unsupervised speech enhancement using dynamical variational autoencoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2993–3007, 2022.
- “Guided variational autoencoder for speech enhancement with a supervised classifier,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
- “Adaptive neural speech enhancement with a denoising variational autoencoder.,” in INTERSPEECH, 2020, pp. 2437–2441.
- “Auto-encoding variational Bayes,” in Proc. International Conference on Learning Representations (ICLR), April 2014.
- “A recurrent variational autoencoder for speech enhancement,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
- “Fast and efficient speech enhancement with variational autoencoders,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Pattern recognition and machine learning, vol. 4, Springer, 2006.
- “Langevin diffusions and metropolis-hastings algorithms,” Methodology and computing in applied probability, vol. 4, pp. 337–357, 2002.
- “Bayesian learning via stochastic gradient langevin dynamics,” in Proceedings of the 28th international conference on machine learning (ICML-11). Citeseer, 2011, pp. 681–688.
- Monte Carlo statistical methods, vol. 2, Springer, 1999.
- “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016.
- “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2001.
- “SDR–half-baked or well done?,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
- “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations, 2020.
- “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 504–511.
- “The QUT-NOISE-SRE protocol for the evaluation of noisy speaker recognition,” in Proceedings of the 16th Annual Conference of the International Speech Communication Association, Interspeech 2015, 2015, pp. 3456–3460.
- A. H. Abdelaziz et al., “NTCD-TIMIT: A new database and baseline for noise-robust audio-visual speech recognition.,” in Interspeech, 2017, pp. 3752–3756.
- “The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 126–130.
- “Description of the RSG-10 noise database,” report IZF, vol. 3, pp. 1988, 1988.