Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization
Abstract: Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features extracted from multichannel audio can perform simultaneous horizontal active speaker detection and localization (ASDL), independently of the visual modality. To address the time and cost of generating ground truth labels to train such a system, we propose a new self-supervised training pipeline that embraces a student-teacher'' learning approach. A conventional pre-trained active speaker detector is adopted as ateacher'' network to provide the position of the speakers as pseudo-labels. The multichannel audio ``student'' network is trained to generate the same results. At inference, the student network can generalize and locate also the occluded speakers that the teacher network is not able to detect visually, yielding considerable improvements in recall rate. Experiments on the TragicTalkers dataset show that an audio network trained with the proposed self-supervised learning approach can exceed the performance of the typical audio-visual methods and produce results competitive with the costly conventional supervised training. We demonstrate that improvements can be achieved when minimal manual supervision is introduced in the learning pipeline. Further gains may be sought with larger training sets and integrating vision with the multichannel audio system.
- I. D. Gebru, S. Ba, X. Li, and R. Horaud, “Audio-visual speaker diarization based on spatiotemporal bayesian fusion,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 40, no. 5, pp. 1086–1099, 2018.
- W. He, P. Motlicek, and J.-M. Odobez, “Deep neural networks for multiple speaker detection and localization,” in IEEE Inter. Conf. on Robotics and Automation, 2018, pp. 74–79.
- R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 10 454–10 464, 2020.
- Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” in European Conf. on Computer Vision, 2018.
- H. Jiang, C. Murdock, and V. K. Ithapu, “Egocentric deep multi-channel audio-visual active speaker localization,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10 534–10 542, 2022.
- J. S. Chung, “Naver at ActivityNet Challenge 2019 - Task B Active speaker detection (AVA),” ArXiv, vol. abs/1906.10555, 2019.
- D. Berghi, M. Volino, and P. J. B. Jackson, “Tragic Talkers: A Shakespearean sound- and light-field dataset for audio-visual machine learning research,” in European Conf. on Visual Media Production, 2022.
- D. Berghi, A. Hilton, and P. J. B. Jackson, “Visually supervised speaker detection and localization via microphone array,” in IEEE 23rd Inter. Workshop on Multimedia Signal Processing, 2021.
- R. Cutler and L. Davis, “Look who’s talking: Speaker detection using video and audio correlation,” in IEEE Inter. Conf. on Multimedia and Expo, vol. 3, 2000, pp. 1589–1592.
- F. Haider and S. A. Moubayed, “Towards speaker detection using lips movements for human-machine multiparty dialogue,” in Proc. of Fonetik, 2012.
- P. Chakravarty, S. Mirzaei, T. Tuytelaars, and H. V. Hamme, “Who’s speaking?: Audio-supervised classification of active speakers in video,” ACM Inter. Conf. on Multimodal Interaction, 2015.
- P. Chakravarty, J. Zegers, T. Tuytelaars, and H. V. Hamme, “Active speaker detection with audio-visual co-training,” ACM Inter. Conf. on Multimodal Interaction, 2016.
- K. Hoover, S. Chaudhuri, C. Pantofaru, I. Sturdy, and M. Slaney, “Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2018, pp. 6558–6562.
- R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection,” in The 29th ACM Inter. Conf. on Multimedia, 2021, p. 3927–3935.
- T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large-scale dataset for visual speech recognition,” ArXiv, vol. abs/1809.00496, 2018.
- J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.
- J. L. Alcazar, M. Cordes, C. Zhao, and B. Ghanem, “End-to-end active speaker detection,” in European Conf. on Computer Vision, 2022.
- X. Qian, M. Madhavi, Z. Pan, J. Wang, and H. Li, “Multi-target doa estimation with an audio-visual fusion mechanism,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2021, pp. 4280–4284.
- M. A. Mohd Izhar, M. Volino, A. Hilton, and P. J. B. Jackson, “Tracking sound sources for object-based spatial audio in 3D audio-visual production,” in Forum Acusticum, 2020, pp. 2051–2058.
- X. Qian, A. Brutti, O. Lanz, M. Omologo, and A. Cavallaro, “Multi-speaker tracking from an audio–visual sensing device,” IEEE Trans. on Multimedia, vol. 21, no. 10, pp. 2576–2588, 2019.
- Q. Liu, W. Wang, T. de Campos, P. J. B. Jackson, and A. Hilton, “Multiple speaker tracking in spatial audio via phd filtering and depth-audio fusion,” IEEE Trans. on Multimedia, vol. 20, no. 7, pp. 1767–1780, 2018.
- I. D. Gebru, S. Ba, G. Evangelidis, and R. Horaud, “Tracking the active speaker based on a joint audio-visual observation model,” in IEEE Inter. Conf. on Computer Vision Workshop, 2015, pp. 702–708.
- H. Do, H. F. Silverman, and Y. Yu, “A real-time SRP-PHAT source location implementation using stochastic region contraction(SRC) on a large-aperture microphone array,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, vol. 1, 2007, pp. I–121–I–124.
- X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “A learning-based approach to direction of arrival estimation in noisy and reverberant environments,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2015, pp. 2814–2818.
- S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE J. of Selected Topics in Signal Processing, vol. 13, pp. 34–48, 2019.
- S. Adavanne, A. Politis, and T. Virtanen, “A multi-room reverberant dataset for sound event localization and detection,” in Proc. of the Detection and Classification of Acoustic Scenes and Events Workshop, 2019, pp. 10–14.
- Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley, “Polyphonic sound event detection and localization using a two-stage strategy,” in Detection and Classification of Acoustic Scenes and Events Workshop, 2019.
- T. N. T. Nguyen, D. L. Jones, and W. Gan, “A sequence matching network for polyphonic sound event localization and detection,” IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, pp. 71–75, 2020.
- K. Shimada, Y. Koyama, N. Takahashi, S. Takahashi, and Y. Mitsufuji, “ACCDOA: Activity-coupled cartesian direction of arrival representation for sound event localization and detection,” IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, pp. 915–919, 2021.
- Y. Cao, T. Iqbal, Q. Kong, Y. Zhong, W. Wang, and M. D. Plumbley, “Event-independent network for polyphonic sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events Workshop, 2020.
- Y. Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley, “An improved event-independent network for polyphonic sound event localization and detection,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2021, pp. 885–889.
- T. N. T. Nguyen, K. N. Watcharasupat, N. K. Nguyen, D. L. Jones, and W.-S. Gan, “Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022.
- T. N. Tho Nguyen, D. L. Jones, K. N. Watcharasupat, H. Phan, and W.-S. Gan, “SALSA-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 716–720.
- T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” in INTERSPEECH, 2018.
- T. L. Pedro Morgado, Nuno Vasconcelos and O. Wang, “Self-supervised generation of spatial audio for 360° video,” in Neural Information Processing Systems, 2018.
- R. Gao and K. Grauman, “2.5D visual sound,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2019.
- K. Yang, B. Russell, and J. Salamon, “Telling left from right: Learning spatial correspondence of sight and sound,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2020.
- R. Arandjelovic and A. Zisserman, “Objects that sound,” in European Conf. on Computer Vision, 2018, pp. 451–466.
- R. Gao, R. Feris, and K. Grauman, “Learning to separate object sounds by watching unlabeled video,” in European Conf. on Computer Vision, 2018.
- R. Gao and K. Grauman, “Co-separating sounds of visual objects,” IEEE/CVF Inter. Conf. on Computer Vision, pp. 3878–3887, 2019.
- H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, “The sound of pixels,” in European Conf. on Computer Vision, 2018.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015.
- Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Neural Information Processing System, 2016.
- A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba, “Ambient sound provides supervision for visual learning,” in The European Conf. on Computer Vision, 2016.
- S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in ACM Multimedia, 2018.
- R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in IEEE/CVF Inter. Conf. on Computer Vision, 2017, pp. 609–617.
- C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba, “Self-supervised moving vehicle tracking with stereo sound,” IEEE/CVF Inter. Conf. on Computer Vision, pp. 7052–7061, 2019.
- F. Rivera Valverde, J. Valeria Hurtado, and A. Valada, “There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2021.
- A. B. Vasudevan, D. Dai, and L. V. Gool, “Semantic object prediction and spatial sound super-resolution with binaural sounds,” in The European Conf. on Computer Vision, 2020, pp. 638–655.
- Y. LeCun. (2019, Apr) Available at: ”https://twitter.com/ylecun/status/1123235709802905600” and at: ”https://www.facebook.com/722677142/posts/10155934004262143/”.
- S. Wu, M. Kan, Z. He, S. Shan, and X. Chen, “Funnel-structured cascade for multi-view face detection with alignment-awareness,” Neurocomputing, 2017.
- S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Li, “S3FD: Single shot scale-invariant face detector,” IEEE Inter. Conf. on Computer Vision, pp. 192–201, 2017.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. of the 32nd Inter. Conf. on Machine Learning, vol. 37, 2015, pp. 448–456.
- D. Berghi and P. J. B. Jackson, “Audio inputs for active speaker detection and localization via microphone array,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2023.
- Google, “WebRTC,” Accessed Nov. 15, 2022. [Online]. Available: https://webrtc.org/
- Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” Inter. J. of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015.
- T. Strybel and K. Fujimoto, “Minimum audible angles in the horizontal and vertical planes: Effects of stimulus onset asynchrony and burst duration.” The J. of the Acoustical Society of America, vol. 108 6, pp. 3092–5, 2000.
- H. Stenzel and P. J. B. Jackson, “Perceptual thresholds of audio-visual spatial coherence for a variety of audio-visual objects,” in Conf. on Audio for Virtual and Augmented Reality, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.