Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-Visual Speech Enhancement in Noisy Environments via Emotion-Based Contextual Cues (2402.16394v1)

Published 26 Feb 2024 in eess.AS and cs.SD

Abstract: In real-world environments, background noise significantly degrades the intelligibility and clarity of human speech. Audio-visual speech enhancement (AVSE) attempts to restore speech quality, but existing methods often fall short, particularly in dynamic noise conditions. This study investigates the inclusion of emotion as a novel contextual cue within AVSE, hypothesizing that incorporating emotional understanding can improve speech enhancement performance. We propose a novel emotion-aware AVSE system that leverages both auditory and visual information. It extracts emotional features from the facial landmarks of the speaker and fuses them with corresponding audio and visual modalities. This enriched data serves as input to a deep UNet-based encoder-decoder network, specifically designed to orchestrate the fusion of multimodal information enhanced with emotion. The network iteratively refines the enhanced speech representation through an encoder-decoder architecture, guided by perceptually-inspired loss functions for joint learning and optimization. We train and evaluate the model on the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset, a rich repository of audio-visual recordings with annotated emotions. Our comprehensive evaluation demonstrates the effectiveness of emotion as a contextual cue for AVSE. By integrating emotional features, the proposed system achieves significant improvements in both objective and subjective assessments of speech quality and intelligibility, especially in challenging noise environments. Compared to baseline AVSE and audio-only speech enhancement systems, our approach exhibits a noticeable increase in PESQ and STOI, indicating higher perceptual quality and intelligibility. Large-scale listening tests corroborate these findings, suggesting improved human understanding of enhanced speech.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 8717–8727, 2018.
  2. T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” arXiv preprint arXiv:1804.04121, 2018.
  3. Y.-W. Chen, J. Hirschberg, and Y. Tsao, “Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement,” arXiv preprint arXiv:2309.01164, 2023.
  4. I.-C. Chern, K.-H. Hung, Y.-T. Chen, T. Hussain, M. Gogate, A. Hussain, Y. Tsao, and J.-C. Hou, “Audio-visual speech enhancement and separation by utilizing multi-modal self-supervised embeddings,” in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW).   IEEE, 2023, pp. 1–5.
  5. S.-Y. Chuang, H.-M. Wang, and Y. Tsao, “Improved lite audio-visual speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1345–1359, 2022.
  6. Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 6, pp. 1109–1121, 1984.
  7. M. Gogate, K. Dashtipour, A. Adeel, and A. Hussain, “Cochleanet: A robust language-independent audio-visual model for real-time speech enhancement,” Information Fusion, vol. 63, pp. 273–285, 2020.
  8. A. Golmakani, M. Sadeghi, and R. Serizel, “Audio-visual speech enhancement with a deep kalman filter generative model,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  9. S. Graetzer, J. Barker, T. J. Cox, M. Akeroyd, J. F. Culling, G. Naylor, E. Porter, and R. Viveros Munoz, “Clarity-2021 challenges: Machine learning challenges for advancing hearing aid processing,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2, 2021.
  10. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  11. T.-A. Hsieh, C. Yu, S.-W. Fu, X. Lu, and Y. Tsao, “Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement,” arXiv preprint arXiv:2010.15174, 2020.
  12. Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on audio, speech, and language processing, vol. 16, no. 1, pp. 229–238, 2007.
  13. T. Hussain, M. Diyan, M. Gogate, K. Dashtipour, A. Adeel, Y. Tsao, and A. Hussain, “A speech intelligibility enhancement model based on canonical correlation and deep learning for hearing-assistive technologies,” arXiv preprint arXiv:2202.04172, 2022.
  14. T. Hussain, M. Gogate, K. Dashtipour, and A. Hussain, “Towards intelligibility-oriented audio-visual speech enhancement,” arXiv preprint arXiv:2111.09642, 2021.
  15. M. Kolbæk, Z.-H. Tan, S. H. Jensen, and J. Jensen, “On loss functions for supervised monaural time-domain speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 825–838, 2020.
  16. R. L. Lai, J.-C. Hou, M. Gogate, K. Dashtipour, A. Hussain, and Y. Tsao, “Audio-visual speech enhancement using self-supervised learning to improve speech intelligibility in cochlear implant simulations,” arXiv preprint arXiv:2307.07748, 2023.
  17. H. Li, S.-W. Fu, Y. Tsao, and J. Yamagishi, “imetricgan: Intelligibility enhancement for speech-in-noise using generative adversarial network-based metric learning,” arXiv preprint arXiv:2004.00932, 2020.
  18. X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.” in Interspeech, vol. 2013, 2013, pp. 436–440.
  19. R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Transactions on speech and audio processing, vol. 9, no. 5, pp. 504–512, 2001.
  20. B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6319–6323.
  21. D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1368–1396, 2021.
  22. M. Neumann and N. T. Vu, “Investigations on audio-visual emotion recognition in noisy conditions,” in 2021 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2021, pp. 358–364.
  23. K. Paliwal, B. Schwerin, and K. Wójcicki, “Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator,” Speech Communication, vol. 54, no. 2, pp. 282–305, 2012.
  24. C. K. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan, “Icassp 2021 deep noise suppression challenge,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6623–6627.
  25. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), vol. 2.   IEEE, 2001, pp. 749–752.
  26. M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “Audio-visual speech enhancement using conditional variational auto-encoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1788–1800, 2020.
  27. A. V. Savchenko, “Hsemotion: High-speed emotion recognition library,” Software Impacts, vol. 14, p. 100433, 2022.
  28. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE international conference on acoustics, speech and signal processing.   IEEE, 2010, pp. 4214–4217.
  29. M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning.   PMLR, 2019, pp. 6105–6114.
  30. J. Thiemann, N. Ito, and E. Vincent, “Diverse environments multichannel acoustic noise database (demand),” 2013.
  31. U. Tiwari, M. Soni, R. Chakraborty, A. Panda, and S. K. Kopparapu, “Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7194–7198.
  32. A. Triantafyllopoulos, G. Keren, J. Wagner, I. Steiner, and B. Schuller, “Towards robust speech emotion recognition using deep residual networks for speech enhancement,” 2019.
  33. T. Vuong, Y. Xia, and R. M. Stern, “A modulation-domain loss for neural-network-based real-time speech enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6643–6647.
  34. Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2014.
  35. G. Zhang, L. Yu, C. Wang, and J. Wei, “Multi-scale temporal frequency convolutional network with axial attention for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 9122–9126.
  36. K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016.
  37. S. Zhao, B. Ma, K. N. Watcharasupat, and W.-S. Gan, “Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 9281–9285.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tassadaq Hussain (9 papers)
  2. Kia Dashtipour (19 papers)
  3. Yu Tsao (200 papers)
  4. Amir Hussain (75 papers)

Summary

We haven't generated a summary for this paper yet.