Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement (2312.01744v1)

Published 4 Dec 2023 in eess.AS

Abstract: This paper proposes SEFGAN, a Deep Neural Network (DNN) combining maximum likelihood training and Generative Adversarial Networks (GANs) for efficient speech enhancement (SE). For this, a DNN is trained to synthesize the enhanced speech conditioned on noisy speech using a Normalizing Flow (NF) as generator in a GAN framework. While the combination of likelihood models and GANs is not trivial, SEFGAN demonstrates that a hybrid adversarial and maximum likelihood training approach enables the model to maintain high quality audio generation and log-likelihood estimation. Our experiments indicate that this approach strongly outperforms the baseline NF-based model without introducing additional complexity to the enhancement network. A comparison using computational metrics and a listening experiment reveals that SEFGAN is competitive with other state-of-the-art models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Y. Koizumi, S. Karita, et al., “SNRi Target Training for Joint Speech Enhancement and Recognition,” in Proc. Interspeech Conf., 2022, pp. 1173–1177.
  2. H. Schröter, T. Rosenkranz, et al., “Low Latency Speech Enhancement for Hearing Aids Using Deep Filtering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 2716–2728, 2022.
  3. T. Matteo, S. Christian, et al., “Dialog+ in Broadcasting: First Field Tests using Deep-Learning Based Dialogue Enhancement,” in Int. Broadcasting Conv. (IBC) Technical Papers, 2021.
  4. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, pp. 1256–1266, 2019.
  5. M. Strauss and B. Edler, “A Flow-Based Neural Network for Time Domain Speech Enhancement,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5754–5758.
  6. Y.-J. Lu, Z.-Q. Wang, et al., “Conditional Diffusion Probabilistic Model for Speech Enhancement,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7402–7406.
  7. M. Strauss, M. Torcoli, and B. Edler, “Improved Normalizing Flow-Based Speech Enhancement Using an all-Pole Gammatone Filterbank for Conditional Input Representation,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 444–450.
  8. J. Richter, S. Welker, et al., “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 2351–2364, 2023.
  9. S.-W. Fu, C. Yu, et al., “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” in Proc. Interspeech Conf., 2021, pp. 201–205.
  10. R. Cao, S. Abdulatif, and B. Yang, “CMGAN: Conformer-based Metric GAN for Speech Enhancement,” in Proc. Interspeech Conf., 2022, pp. 936–940.
  11. J. Serrà, S. Pascual, et al., “Universal Speech Enhancement with Score-based Diffusion,” 2022. [Online]. Available: https://arxiv.org/abs/2206.03065
  12. S. Welker, J. Richter, and T. Gerkmann, “Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain,” in Proc. Interspeech Conf., 2022, pp. 2928–2932.
  13. J. Song, C. Meng, and S. Ermon, “Denoising Diffusion Implicit Models,” in 9th Int. Conf. on Learning Representations, ICLR, 2021.
  14. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 3617–3621.
  15. A. Grover, M. Dhar, and S. Ermon, “Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in Generative Models,” in Proc. of the 32nd AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  16. A. van den Oord et al., “Wavenet: A generative model for raw audio,” in arXiv, 2016. [Online]. Available: https://arxiv.org/abs/1609.03499
  17. C. Macartney and T. Weyde, “Improved Speech Enhancement with the Wave-U-Net,” 2018. [Online]. Available: https://arxiv.org/abs/1811.11307
  18. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 17 022–17 033.
  19. R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6199–6203.
  20. P. Isola, J.-Y. Zhu, et al., “Image-to-Image Translation with Conditional Adversarial Networks,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
  21. J. S. Garofolo, D. Graff, et al., “CSR-I (WSJ0) Complete.” [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S6A
  22. J. Barker, R. Marxer, et al., “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 504–511.
  23. J. L. Roux, S. Wisdom, et al., “SDR – Half-baked or Well Done?” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630.
  24. T. Kastner and J. Herre, “An Efficient Model for Estimating Subjective Quality of Separated Audio Source Signals,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019, pp. 95–99.
  25. M. Ravanelli et al., “SpeechBrain: A General-Purpose Speech Toolkit,” 2021. [Online]. Available: https://arxiv.org/abs/2106.04624
  26. International Telecommunication Union, “Recommendation ITU–T P.862 Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone band and wideband digital codes,” 2000.
  27. J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 11, pp. 2009–2022, 2016.
  28. International Telecommunication Union, “Recommendation ITU–R BS.1534-3 Method for the subjective assessment of intermediate quality level of audio systems,” 2015.
  29. M. Schoeffler et al., “webMUSHRA — A Comprehensive Framework for Web-based Listening Tests,” Journal of Open Research Software, vol. 6, no. 1, p. 8, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.