SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement (2312.01744v1)
Abstract: This paper proposes SEFGAN, a Deep Neural Network (DNN) combining maximum likelihood training and Generative Adversarial Networks (GANs) for efficient speech enhancement (SE). For this, a DNN is trained to synthesize the enhanced speech conditioned on noisy speech using a Normalizing Flow (NF) as generator in a GAN framework. While the combination of likelihood models and GANs is not trivial, SEFGAN demonstrates that a hybrid adversarial and maximum likelihood training approach enables the model to maintain high quality audio generation and log-likelihood estimation. Our experiments indicate that this approach strongly outperforms the baseline NF-based model without introducing additional complexity to the enhancement network. A comparison using computational metrics and a listening experiment reveals that SEFGAN is competitive with other state-of-the-art models.
- Y. Koizumi, S. Karita, et al., “SNRi Target Training for Joint Speech Enhancement and Recognition,” in Proc. Interspeech Conf., 2022, pp. 1173–1177.
- H. Schröter, T. Rosenkranz, et al., “Low Latency Speech Enhancement for Hearing Aids Using Deep Filtering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 2716–2728, 2022.
- T. Matteo, S. Christian, et al., “Dialog+ in Broadcasting: First Field Tests using Deep-Learning Based Dialogue Enhancement,” in Int. Broadcasting Conv. (IBC) Technical Papers, 2021.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, pp. 1256–1266, 2019.
- M. Strauss and B. Edler, “A Flow-Based Neural Network for Time Domain Speech Enhancement,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5754–5758.
- Y.-J. Lu, Z.-Q. Wang, et al., “Conditional Diffusion Probabilistic Model for Speech Enhancement,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7402–7406.
- M. Strauss, M. Torcoli, and B. Edler, “Improved Normalizing Flow-Based Speech Enhancement Using an all-Pole Gammatone Filterbank for Conditional Input Representation,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 444–450.
- J. Richter, S. Welker, et al., “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 2351–2364, 2023.
- S.-W. Fu, C. Yu, et al., “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” in Proc. Interspeech Conf., 2021, pp. 201–205.
- R. Cao, S. Abdulatif, and B. Yang, “CMGAN: Conformer-based Metric GAN for Speech Enhancement,” in Proc. Interspeech Conf., 2022, pp. 936–940.
- J. Serrà, S. Pascual, et al., “Universal Speech Enhancement with Score-based Diffusion,” 2022. [Online]. Available: https://arxiv.org/abs/2206.03065
- S. Welker, J. Richter, and T. Gerkmann, “Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain,” in Proc. Interspeech Conf., 2022, pp. 2928–2932.
- J. Song, C. Meng, and S. Ermon, “Denoising Diffusion Implicit Models,” in 9th Int. Conf. on Learning Representations, ICLR, 2021.
- R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 3617–3621.
- A. Grover, M. Dhar, and S. Ermon, “Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in Generative Models,” in Proc. of the 32nd AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- A. van den Oord et al., “Wavenet: A generative model for raw audio,” in arXiv, 2016. [Online]. Available: https://arxiv.org/abs/1609.03499
- C. Macartney and T. Weyde, “Improved Speech Enhancement with the Wave-U-Net,” 2018. [Online]. Available: https://arxiv.org/abs/1811.11307
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 17 022–17 033.
- R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6199–6203.
- P. Isola, J.-Y. Zhu, et al., “Image-to-Image Translation with Conditional Adversarial Networks,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
- J. S. Garofolo, D. Graff, et al., “CSR-I (WSJ0) Complete.” [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S6A
- J. Barker, R. Marxer, et al., “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 504–511.
- J. L. Roux, S. Wisdom, et al., “SDR – Half-baked or Well Done?” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630.
- T. Kastner and J. Herre, “An Efficient Model for Estimating Subjective Quality of Separated Audio Source Signals,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019, pp. 95–99.
- M. Ravanelli et al., “SpeechBrain: A General-Purpose Speech Toolkit,” 2021. [Online]. Available: https://arxiv.org/abs/2106.04624
- International Telecommunication Union, “Recommendation ITU–T P.862 Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone band and wideband digital codes,” 2000.
- J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 11, pp. 2009–2022, 2016.
- International Telecommunication Union, “Recommendation ITU–R BS.1534-3 Method for the subjective assessment of intermediate quality level of audio systems,” 2015.
- M. Schoeffler et al., “webMUSHRA — A Comprehensive Framework for Web-based Listening Tests,” Journal of Open Research Software, vol. 6, no. 1, p. 8, 2018.