Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 183 tok/s Pro
2000 character limit reached

LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement (2409.02266v1)

Published 3 Sep 2024 in cs.SD, cs.LG, cs.MM, and eess.AS

Abstract: In this paper, we propose long short term memory speech enhancement network (LSTMSE-Net), an audio-visual speech enhancement (AVSE) method. This innovative method leverages the complementary nature of visual and audio information to boost the quality of speech signals. Visual features are extracted with VisualFeatNet (VFN), and audio features are processed through an encoder and decoder. The system scales and concatenates visual and audio features, then processes them through a separator network for optimized speech enhancement. The architecture highlights advancements in leveraging multi-modal data and interpolation techniques for robust AVSE challenge systems. The performance of LSTMSE-Net surpasses that of the baseline model from the COG-MHEAR AVSE Challenge 2024 by a margin of 0.06 in scale-invariant signal-to-distortion ratio (SISDR), $0.03$ in short-time objective intelligibility (STOI), and $1.32$ in perceptual evaluation of speech quality (PESQ). The source code of the proposed LSTMSE-Net is available at \url{https://github.com/mtanveer1/AVSEC-3-Challenge}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.” in Interspeech, vol. 2013, 2013, pp. 436–440.
  2. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2014, pp. 1562–1566.
  3. J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audio-visual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018.
  4. B. Xu, C. Lu, Y. Guo, and J. Wang, “Discriminative multi-modality speech recognition,” in Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, June 2020, pp. 14 433–14 442.
  5. D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, p. 1368–1396, 2021.
  6. M. Tanveer, A. Rastogi, V. Paliwal, M. Ganaie, A. K. Malik, J. Del Ser, and C.-T. Lin, “Ensemble deep learning in speech signal tasks: a review,” Neurocomputing, vol. 550, p. 126436, 2023.
  7. K. Tan and D. Wang, “Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6865–6869.
  8. D. S. Williamson and D. Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1492–1501, 2017.
  9. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, August 2020, pp. 2472–2476.
  10. H.-S. Choi, J. Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, “Phase-aware speech enhancement with deep complex U-Net,” in ICLR, 2019. [Online]. Available: https://openreview.net/forum?id=SkeRTsAcYm
  11. T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2018, 2018, pp. 3244–3248.
  12. S.-Y. Chuang, H.-M. Wang, and Y. Tsao, “Improved lite audio-visual speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1345–1359, 2022.
  13. I.-C. Chern, K.-H. Hung, Y.-T. Chen, T. Hussain, M. Gogate, A. Hussain, Y. Tsao, and J.-C. Hou, “Audio-visual speech enhancement and separation by utilizing multi-modal self-supervised embeddings,” in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2023, pp. 1–5.
  14. Y. Wu, C. Li, J. Bai, Z. Wu, and Y. Qian, “Time-domain audio-visual speech separation on low quality videos,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 256–260.
  15. R. L. Lai, J.-C. Hou, M. Gogate, K. Dashtipour, A. Hussain, and Y. Tsao, “Audio-visual speech enhancement using self-supervised learning to improve speech intelligibility in cochlear implant simulations,” arXiv preprint arXiv:2307.07748, 2023.
  16. H. Wang, K. Li, and C. Xu, “[retracted] a new generation of resnet model based on artificial intelligence and few data driven and its construction in image recognition model,” Computational Intelligence and Neuroscience, vol. 2022, no. 1, p. 5976155, 2022.
  17. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  18. A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM networks,” in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 4.   IEEE, 2005, pp. 2047–2052.
  19. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp. 770–778.
  20. J. Shipton, J. Fowler, C. Chalmers, S. Davis, S. Gooch, and G. Coccia, “Implementing wavenet using Intel® Stratix® 10 NX FPGA for real-time speech synthesis,” pp. 1–8, 2021.
  21. T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: A large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
  22. S. Graetzer, J. Barker, T. J. Cox, M. Akeroyd, J. F. Culling, G. Naylor, E. Porter, and R. Viveros Munoz, “Clarity-2021 challenges: Machine learning challenges for advancing hearing aid processing,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2, 2021, pp. 686–690.
  23. J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics, vol. 19, no. 1.   AIP Publishing, 2013, p. 035081.
  24. R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “MedleyDB: A multitrack dataset for annotation-intensive mir research.” in ISMIR, vol. 14, 2014, pp. 155–160.
  25. C. K. Reddy, G. Vishak, C. Ross, B. Ebrahim, C. Roger, D. Harishchandra, M. Sergiy, A. Robert, A. Ashkan, B. Sebastian, R. Puneet, S. Sriram, and G. Johannes, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” Interspeech 2020, 10 2020. [Online]. Available: https://cir.nii.ac.jp/crid/1360016869793872128
  26. K. J. Piczak, “ESC: Dataset for environmental sound classification,” ser. MM ’15.   New York, NY, USA: Association for Computing Machinery, 2015, p. 1015–1018. [Online]. Available: https://doi.org/10.1145/2733373.2806390
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com