Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training (2404.00861v1)
Abstract: Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selective listening ability are short of effectively filtering out disruptive voice components from mixed audio inputs. In this paper, we propose a Multi-modal Speaker Extraction-to-Detection framework named `MuSED', which is pre-trained with audio-visual target speaker extraction to learn the denoising ability, then it is fine-tuned with the AV-ASD task. Meanwhile, to better capture the multi-modal information and deal with real-world problems such as missing modality, MuSED is modelled on the time domain directly and integrates the multi-modal plus-and-minus augmentation strategy. Our experiments demonstrate that MuSED substantially outperforms the state-of-the-art AV-ASD methods and achieves 95.6% mAP on the AVA-ActiveSpeaker dataset, 98.3% AP on the ASW dataset, and 97.9% F1 on the Columbia AV-ASD dataset, respectively. We will publicly release the code in due course.
- S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Transactions on Multimedia (TMM), vol. 2, no. 3, pp. 141–151, 2000.
- A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Transactions on Graphics, vol. 37, no. 4, pp. 112:1–112:11, 2018.
- J. Xiong, Y. Zhou, P. Zhang, L. Xie, W. Huang, and Y. Zha, “Look & listen: Multi-modal correlation learning for active speaker detection and speech enhancement,” IEEE Transactions on Multimedia (TMM), vol. 25, pp. 5800–5812, 2023.
- J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi et al., “AVA active speaker: An audio-visual dataset for active speaker detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 4492–4496.
- R. Tao, K. A. Lee, R. K. Das, V. Hautamäki, and H. Li, “Self-supervised training of speaker encoder with multi-modal diverse positive pairs,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 31, pp. 1706–1719, 2023.
- F. Patrona, A. Iosifidis, A. Tefas, N. Nikolaidis, and I. Pitas, “Visual voice activity detection in the wild,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 967–977, 2016.
- E. Z. Xu, Z. Song, S. Tsutsui, C. Feng, M. Ye, and M. Z. Shou, “AVA-AVD: Audio-visual speaker diarization in the wild,” in ACM International Conference on Multimedia, 2022, p. 3838–3847.
- X. Qian, Z. Wang, J. Wang, G. Guan, and H. Li, “Audio-visual cross-attention network for robotic speaker tracking,” IEEE/ACM Transaction on Audio, Speech and Language Processing (TASLP), vol. 31, pp. 550–562, 2023.
- T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 44, no. 12, pp. 8717–8727, 2022.
- K. Min, S. Roy, S. Tripathi, T. Guha, and S. Majumdar, “Learning long-term spatial-temporal graphs for active speaker detection,” in European Conference on Computer Vision (ECCV), 2022, pp. 371–387.
- T. Afouras, J. S. Chung, and A. Zisserman, “My lips are concealed: Audio-visual speech enhancement through obstructions,” Interspeech, pp. 4295–4299, 2019.
- M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, and X. Peng, “Smil: Multimodal learning with severely missing modality,” in AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2302–2310.
- K. T. Voo, L. Jiang, and C. C. Loy, “Delving into high-quality synthetic face occlusion segmentation datasets,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4711–4720.
- J. Wang, Z. Pan, M. Zhang, R. T. Tan, and H. Li, “Restoring speaking lips from occlusion for audio-visual speech recognition,” in AAAI Conference on Artificial Intelligence, 2024.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning (ICML), 2020, pp. 1597–1607.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 29, pp. 3451–3460, 2021.
- A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning (ICML), 2022, pp. 1298–1312.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” arXiv preprint arXiv:2201.02184, 2022.
- R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection,” in ACM International Conference on Multimedia, 2021, p. 3927–3935.
- Y. Zhang, S. Liang, S. Yang, X. Liu, Z. Wu, S. Shan, and X. Chen, “UniCon: Unified context network for robust active speaker detection,” in ACM International Conference on Multimedia, 2021, pp. 3964–3972.
- J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu, “Time domain audio visual speech separation,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 667–673.
- Z. Pan, R. Tao, C. Xu, and H. Li, “Muse: Multi-modal target speaker extraction with visual cues,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6678–6682.
- Z. Pan, M. Ge, and H. Li, “USEV: Universal speaker extraction with visual cue,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 30, pp. 3032–3045, 2022.
- N. Mesgarani and E. F. Chang, “Selective cortical representation of attended speaker in multi-talker speech perception,” Nature, vol. 485, no. 7397, pp. 233–236, 2012.
- A. W. Bronkhorst, “The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,” Acta Acustica united with Acustica, vol. 86, no. 1, pp. 117–128, 2000.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 000–16 009.
- J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Asian conference on computer vision, 2016, pp. 251–263.
- J. S. Chung, “Naver at ActivityNet Challenge 2019–Task B Active Speaker Detection (AVA),” arXiv preprint arXiv:1906.10555, 2019.
- J. L. Alcázar, F. Caba, L. Mai, F. Perazzi, J.-Y. Lee, P. Arbeláez, and B. Ghanem, “Active speakers in context,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12 465–12 474.
- M. Shvets, W. Liu, and A. C. Berg, “Leveraging long-range temporal relationships between proposals for video object detection,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9756–9764.
- G. Datta, T. Etchart, V. Yadav, V. Hedau, P. Natarajan, and S.-F. Chang, “ASD-Transformer: Efficient active speaker detection using self and multimodal transformers,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4568–4572.
- J. L. Alcázar, M. Cordes, C. Zhao, and B. Ghanem, “End-to-end active speaker detection,” in European Conference on Computer Vision (ECCV), 2022, pp. 126–143.
- Y. Jiang, R. Tao, Z. Pan, and H. Li, “Target Active Speaker Detection with Audio-visual Cues,” in INTERSPEECH, 2023, pp. 3152–3156.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems (NeurIPS), vol. 30, 2017.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Jun. 2019, pp. 4171–4186.
- Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” Interspeech, 2019.
- C. Xu, W. Rao, E. S. Chng, and H. Li, “Time-domain speaker extraction network,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 327–334.
- T. Afouras, J. S. Chung, and A. Zisserman, “The Conversation: deep audio-visual speech enhancement,” in Interspeech, 2018, pp. 3244–3248.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50.
- J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630.
- Y. J. Kim, H.-S. Heo, S. Choe, S.-W. Chung, Y. Kwon, B.-J. Lee, Y. Kwon, and J. S. Chung, “Look who’s talking: Active speaker detection in the wild,” in Interspeech, 2021, pp. 3675–3679.
- P. Chakravarty and T. Tuytelaars, “Cross-modal supervision for learning active speaker detection in video,” in European Conference on Computer Vision (ECCV), 2016, pp. 285–301.
- J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Interspeech, 2018, pp. 1086–1090.
- T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
- D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” ArXiv, vol. abs/1510.08484, 2015.
- O. Köpüklü, M. Taseska, and G. Rigoll, “How to design a three-stage architecture for audio-visual active speaker detection in the wild,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1193–1203.
- Y.-H. Zhang, J. Xiao, S. Yang, and S. Shan, “Multi-task learning for audio-visual active speaker detection,” The ActivityNet Large-Scale Activity Recognition Challenge, 2019.
- J. L. Alcázar, F. Caba, A. K. Thabet, and B. Ghanem, “MAAS: Multi-modal assignation for active speaker detection,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 265–274.
- B. Pouthier, L. Pilati, L. K. Gudupudi, C. Bouveyron, and F. Precioso, “Active speaker detection as a multi-objective optimization with uncertainty-based multimodal fusion,” in Interspeech, 2021, pp. 2381–2385.
- J. Liao, H. Duan, K. Feng, W. Zhao, Y. Yang, and L. Chen, “A light weight model for active speaker detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22 932–22 941.
- Y. Zhang, S. Liang, S. Yang, and S. Shan, “UniCon+: Ictcas-ucas submission to the ava-activespeaker task at activitynet challenge 2022,” arXiv preprint arXiv:2206.10861, 2022.
- A. Wuerkaixi, Y. Zhang, Z. Duan, and C. Zhang, “Rethinking audio-visual synchronization for active speaker detection,” in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2022, pp. 01–06.
- C. Wang, “Lip movements information disentanglement for lip sync,” arXiv preprint arXiv:2202.06198, 2022.
- M. Shahid, C. Beyan, and V. Murino, “Comparisons of visual activity primitives for voice activity detection,” in International Conference on Image Analysis and Processing, 2019, pp. 48–59.
- T. Afouras, A. Owens, J. S. Chung, and A. Zisserman, “Self-supervised learning of audio-visual objects from video,” in European Conference on Computer Vision (ECCV), 2020, p. 208–224.
- C. Beyan, M. Shahid, and V. Murino, “RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis,” IEEE Transactions on Multimedia (TMM), vol. 23, pp. 2071–2085, 2020.
- M. Shahid, C. Beyan, and V. Murino, “S-VVAD: Visual Voice Activity Detection by Motion Segmentation,” in IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2332–2341.
- R. Sharma and S. Narayanan, “Audio-visual activity guided cross-modal identity association for active speaker detection,” IEEE Open Journal of Signal Processing, vol. 4, pp. 225–232, 2023.
- M. Afifi, “11k Hands: Gender recognition and biometric identification using a large dataset of hand images,” Multimedia Tools and Applications, vol. 78, pp. 20 835–20 854, 2019.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014, pp. 740–755.
- Ruijie Tao (25 papers)
- Xinyuan Qian (30 papers)
- Rohan Kumar Das (50 papers)
- Xiaoxue Gao (21 papers)
- Jiadong Wang (19 papers)
- Haizhou Li (286 papers)