Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training (2404.00861v1)

Published 1 Apr 2024 in eess.AS and eess.IV

Abstract: Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selective listening ability are short of effectively filtering out disruptive voice components from mixed audio inputs. In this paper, we propose a Multi-modal Speaker Extraction-to-Detection framework named `MuSED', which is pre-trained with audio-visual target speaker extraction to learn the denoising ability, then it is fine-tuned with the AV-ASD task. Meanwhile, to better capture the multi-modal information and deal with real-world problems such as missing modality, MuSED is modelled on the time domain directly and integrates the multi-modal plus-and-minus augmentation strategy. Our experiments demonstrate that MuSED substantially outperforms the state-of-the-art AV-ASD methods and achieves 95.6% mAP on the AVA-ActiveSpeaker dataset, 98.3% AP on the ASW dataset, and 97.9% F1 on the Columbia AV-ASD dataset, respectively. We will publicly release the code in due course.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Transactions on Multimedia (TMM), vol. 2, no. 3, pp. 141–151, 2000.
  2. A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Transactions on Graphics, vol. 37, no. 4, pp. 112:1–112:11, 2018.
  3. J. Xiong, Y. Zhou, P. Zhang, L. Xie, W. Huang, and Y. Zha, “Look & listen: Multi-modal correlation learning for active speaker detection and speech enhancement,” IEEE Transactions on Multimedia (TMM), vol. 25, pp. 5800–5812, 2023.
  4. J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi et al., “AVA active speaker: An audio-visual dataset for active speaker detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 4492–4496.
  5. R. Tao, K. A. Lee, R. K. Das, V. Hautamäki, and H. Li, “Self-supervised training of speaker encoder with multi-modal diverse positive pairs,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 31, pp. 1706–1719, 2023.
  6. F. Patrona, A. Iosifidis, A. Tefas, N. Nikolaidis, and I. Pitas, “Visual voice activity detection in the wild,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 967–977, 2016.
  7. E. Z. Xu, Z. Song, S. Tsutsui, C. Feng, M. Ye, and M. Z. Shou, “AVA-AVD: Audio-visual speaker diarization in the wild,” in ACM International Conference on Multimedia, 2022, p. 3838–3847.
  8. X. Qian, Z. Wang, J. Wang, G. Guan, and H. Li, “Audio-visual cross-attention network for robotic speaker tracking,” IEEE/ACM Transaction on Audio, Speech and Language Processing (TASLP), vol. 31, pp. 550–562, 2023.
  9. T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 44, no. 12, pp. 8717–8727, 2022.
  10. K. Min, S. Roy, S. Tripathi, T. Guha, and S. Majumdar, “Learning long-term spatial-temporal graphs for active speaker detection,” in European Conference on Computer Vision (ECCV), 2022, pp. 371–387.
  11. T. Afouras, J. S. Chung, and A. Zisserman, “My lips are concealed: Audio-visual speech enhancement through obstructions,” Interspeech, pp. 4295–4299, 2019.
  12. M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, and X. Peng, “Smil: Multimodal learning with severely missing modality,” in AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2302–2310.
  13. K. T. Voo, L. Jiang, and C. C. Loy, “Delving into high-quality synthetic face occlusion segmentation datasets,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4711–4720.
  14. J. Wang, Z. Pan, M. Zhang, R. T. Tan, and H. Li, “Restoring speaking lips from occlusion for audio-visual speech recognition,” in AAAI Conference on Artificial Intelligence, 2024.
  15. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning (ICML), 2020, pp. 1597–1607.
  16. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 29, pp. 3451–3460, 2021.
  17. A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning (ICML), 2022, pp. 1298–1312.
  18. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  19. B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” arXiv preprint arXiv:2201.02184, 2022.
  20. R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection,” in ACM International Conference on Multimedia, 2021, p. 3927–3935.
  21. Y. Zhang, S. Liang, S. Yang, X. Liu, Z. Wu, S. Shan, and X. Chen, “UniCon: Unified context network for robust active speaker detection,” in ACM International Conference on Multimedia, 2021, pp. 3964–3972.
  22. J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu, “Time domain audio visual speech separation,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2019, pp. 667–673.
  23. Z. Pan, R. Tao, C. Xu, and H. Li, “Muse: Multi-modal target speaker extraction with visual cues,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6678–6682.
  24. Z. Pan, M. Ge, and H. Li, “USEV: Universal speaker extraction with visual cue,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 30, pp. 3032–3045, 2022.
  25. N. Mesgarani and E. F. Chang, “Selective cortical representation of attended speaker in multi-talker speech perception,” Nature, vol. 485, no. 7397, pp. 233–236, 2012.
  26. A. W. Bronkhorst, “The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,” Acta Acustica united with Acustica, vol. 86, no. 1, pp. 117–128, 2000.
  27. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 000–16 009.
  28. J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Asian conference on computer vision, 2016, pp. 251–263.
  29. J. S. Chung, “Naver at ActivityNet Challenge 2019–Task B Active Speaker Detection (AVA),” arXiv preprint arXiv:1906.10555, 2019.
  30. J. L. Alcázar, F. Caba, L. Mai, F. Perazzi, J.-Y. Lee, P. Arbeláez, and B. Ghanem, “Active speakers in context,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12 465–12 474.
  31. M. Shvets, W. Liu, and A. C. Berg, “Leveraging long-range temporal relationships between proposals for video object detection,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9756–9764.
  32. G. Datta, T. Etchart, V. Yadav, V. Hedau, P. Natarajan, and S.-F. Chang, “ASD-Transformer: Efficient active speaker detection using self and multimodal transformers,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4568–4572.
  33. J. L. Alcázar, M. Cordes, C. Zhao, and B. Ghanem, “End-to-end active speaker detection,” in European Conference on Computer Vision (ECCV), 2022, pp. 126–143.
  34. Y. Jiang, R. Tao, Z. Pan, and H. Li, “Target Active Speaker Detection with Audio-visual Cues,” in INTERSPEECH, 2023, pp. 3152–3156.
  35. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
  36. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems (NeurIPS), vol. 30, 2017.
  37. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
  38. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Jun. 2019, pp. 4171–4186.
  39. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” Interspeech, 2019.
  40. C. Xu, W. Rao, E. S. Chng, and H. Li, “Time-domain speaker extraction network,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 327–334.
  41. T. Afouras, J. S. Chung, and A. Zisserman, “The Conversation: deep audio-visual speech enhancement,” in Interspeech, 2018, pp. 3244–3248.
  42. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50.
  43. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630.
  44. Y. J. Kim, H.-S. Heo, S. Choe, S.-W. Chung, Y. Kwon, B.-J. Lee, Y. Kwon, and J. S. Chung, “Look who’s talking: Active speaker detection in the wild,” in Interspeech, 2021, pp. 3675–3679.
  45. P. Chakravarty and T. Tuytelaars, “Cross-modal supervision for learning active speaker detection in video,” in European Conference on Computer Vision (ECCV), 2016, pp. 285–301.
  46. J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Interspeech, 2018, pp. 1086–1090.
  47. T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
  48. D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” ArXiv, vol. abs/1510.08484, 2015.
  49. O. Köpüklü, M. Taseska, and G. Rigoll, “How to design a three-stage architecture for audio-visual active speaker detection in the wild,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1193–1203.
  50. Y.-H. Zhang, J. Xiao, S. Yang, and S. Shan, “Multi-task learning for audio-visual active speaker detection,” The ActivityNet Large-Scale Activity Recognition Challenge, 2019.
  51. J. L. Alcázar, F. Caba, A. K. Thabet, and B. Ghanem, “MAAS: Multi-modal assignation for active speaker detection,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 265–274.
  52. B. Pouthier, L. Pilati, L. K. Gudupudi, C. Bouveyron, and F. Precioso, “Active speaker detection as a multi-objective optimization with uncertainty-based multimodal fusion,” in Interspeech, 2021, pp. 2381–2385.
  53. J. Liao, H. Duan, K. Feng, W. Zhao, Y. Yang, and L. Chen, “A light weight model for active speaker detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22 932–22 941.
  54. Y. Zhang, S. Liang, S. Yang, and S. Shan, “UniCon+: Ictcas-ucas submission to the ava-activespeaker task at activitynet challenge 2022,” arXiv preprint arXiv:2206.10861, 2022.
  55. A. Wuerkaixi, Y. Zhang, Z. Duan, and C. Zhang, “Rethinking audio-visual synchronization for active speaker detection,” in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2022, pp. 01–06.
  56. C. Wang, “Lip movements information disentanglement for lip sync,” arXiv preprint arXiv:2202.06198, 2022.
  57. M. Shahid, C. Beyan, and V. Murino, “Comparisons of visual activity primitives for voice activity detection,” in International Conference on Image Analysis and Processing, 2019, pp. 48–59.
  58. T. Afouras, A. Owens, J. S. Chung, and A. Zisserman, “Self-supervised learning of audio-visual objects from video,” in European Conference on Computer Vision (ECCV), 2020, p. 208–224.
  59. C. Beyan, M. Shahid, and V. Murino, “RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis,” IEEE Transactions on Multimedia (TMM), vol. 23, pp. 2071–2085, 2020.
  60. M. Shahid, C. Beyan, and V. Murino, “S-VVAD: Visual Voice Activity Detection by Motion Segmentation,” in IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2332–2341.
  61. R. Sharma and S. Narayanan, “Audio-visual activity guided cross-modal identity association for active speaker detection,” IEEE Open Journal of Signal Processing, vol. 4, pp. 225–232, 2023.
  62. M. Afifi, “11k Hands: Gender recognition and biometric identification using a large dataset of hand images,” Multimedia Tools and Applications, vol. 78, pp. 20 835–20 854, 2019.
  63. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014, pp. 740–755.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ruijie Tao (25 papers)
  2. Xinyuan Qian (30 papers)
  3. Rohan Kumar Das (50 papers)
  4. Xiaoxue Gao (21 papers)
  5. Jiadong Wang (19 papers)
  6. Haizhou Li (286 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com