Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Active Speaker Detection in Noisy Environments (2403.19002v2)

Published 27 Mar 2024 in cs.MM, cs.CV, cs.SD, and eess.AS

Abstract: This paper addresses the issue of active speaker detection (ASD) in noisy environments and formulates a robust active speaker detection (rASD) problem. Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance. To overcome this, we propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features. These features are then utilized in an ASD model, and both tasks are jointly optimized in an end-to-end framework. Our proposed framework mitigates residual noise and audio quality reduction issues that can occur in a naive cascaded two-stage framework that directly uses separated speech for ASD, and enables the two tasks to be optimized simultaneously. To further enhance the robustness of the audio features and handle inherent speech noises, we propose a dynamic weighted loss approach to train the speech separator. We also collected a real-world noise audio dataset to facilitate investigations. Experiments demonstrate that non-speech audio noises significantly impact ASD models, and our proposed approach improves ASD performance in noisy environments. The framework is general and can be applied to different ASD approaches to improve their robustness. Our code, models, and data will be released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. The conversation: Deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121, 2018.
  2. Active speakers in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12465–12474, 2020.
  3. Maas: Multi-modal assignation for active speaker detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 265–274, 2021.
  4. End-to-end active speaker detection. In European Conference on Computer Vision, pages 126–143. Springer, 2022.
  5. vviswa–a multilingual multi-pose audio visual database for robust human computer interaction. International Journal of Computer Applications, 137(4):25–31, 2004.
  6. Smart room: Participant and speaker localization and identification. In Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., pages ii–1117. IEEE, 2005.
  7. Ava-speech: A densely labeled dataset of speech activity in movies. In Proceedings of Interspeech, 2018, 2018.
  8. Facefilter: Audio-visual speech separation using still images. arXiv preprint arXiv:2005.07074, 2020.
  9. Look who’s talking: Speaker detection using video and audio correlation. In 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No. 00TH8532), pages 1589–1592. IEEE, 2000.
  10. Multimodal active speaker detection and virtual cinematography for video conferencing. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4527–4531. IEEE, 2020.
  11. Asd-transformer: Efficient active speaker detection using self and multimodal transformers. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4568–4572. IEEE, 2022.
  12. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619, 2018.
  13. Taking the bite out of automated naming of characters in tv video. Image and Vision Computing, 27(5):545–559, 2009.
  14. Visual speech enhancement. arXiv preprint arXiv:1711.08789, 2017.
  15. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10478–10487, 2020.
  16. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3879–3888, 2019.
  17. Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15490–15500. IEEE, 2021.
  18. Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 35–53, 2018.
  19. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
  20. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. Real-time speaker identification and verification. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):277–288, 2005.
  23. How to design a three-stage architecture for audio-visual active speaker detection in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1193–1203, 2021.
  24. Maas: Multi-modal assignation for active speaker detection. arXiv preprint arXiv:2101.03682, 2021.
  25. A light weight model for active speaker detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22932–22941, 2023.
  26. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
  27. Multimodality in vr: A survey. ACM Computing Surveys (CSUR), 54(10s):1–36, 2022.
  28. Rafael G. Dantas Miao Wang, Christoph Boeddeker and ananda seelan. Pesq (perceptual evaluation of speech quality) wrapper for python users, 2022.
  29. Learning long-term spatial-temporal graphs for active speaker detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 371–387. Springer, 2022.
  30. Seeing voices and hearing faces: Cross-modal biometric matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8427–8436, 2018.
  31. Audio-visual scene analysis with self-supervised multisensory features. European Conference on Computer Vision (ECCV), 2018.
  32. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  33. Ava active speaker: An audio-visual dataset for active speaker detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4492–4496. IEEE, 2020.
  34. Self-supervised segmentation and source separation on videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  35. Visual speech recognition with loosely synchronized feature streams. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, pages 1424–1431. IEEE, 2005.
  36. Real-time speaker identification for video conferencing. In Real-Time Image and Video Processing 2010, pages 115–123. SPIE, 2010.
  37. Efficient large-scale audio tagging via transformer-to-cnn knowledge distillation. arXiv preprint arXiv:2211.04772, 2022.
  38. Audio-visual activity guided cross-modal identity association for active speaker detection. IEEE Open Journal of Signal Processing, 2023.
  39. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3927–3935, 2021.
  40. Cyclic co-learning of sounding object visual grounding and sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2745–2754, 2021.
  41. Improving on-screen sound separation for open-domain videos with audio-visual self-attention. arXiv preprint arXiv:2106.09669, 2021.
  42. Look&listen: Multi-modal correlation learning for active speaker detection and speech enhancement. IEEE Transactions on Multimedia, 2022.
  43. Recursive visual sound separation using minus-plus net. In Proceedings of the IEEE International Conference on Computer Vision, pages 882–891, 2019.
  44. Unicon: Unified context network for robust active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3964–3972, 2021.
  45. Multi-task learning for audio-visual active speaker detection. The ActivityNet Large-Scale Activity Recognition Challenge, 4, 2019.
  46. The sound of pixels. In Proceedings of the European conference on computer vision (ECCV), pages 570–586, 2018.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com