Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning (2309.12306v1)

Published 21 Sep 2023 in cs.CV, cs.SD, and eess.AS

Abstract: The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. “AVA-ActiveSpeaker: An audio-visual dataset for active speaker detection,” in Proc. ICASSP, 2020, pp. 4492–4496.
  2. “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” in Proc. ACM MM, 2021, p. 3927–3935.
  3. “A light weight model for active speaker detection,” in Proc. CVPR, June 2023, pp. 22932–22941.
  4. “LoCoNet: Long-short context network for active speaker detection,” 2023.
  5. “Deep audio-visual speech recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8717–8727, 2018.
  6. “Audio-visual scene analysis with self-supervised multisensory features,” in Proc. ECCV, 2018, pp. 631–648.
  7. “The conversation: Deep audio-visual speech enhancement,” in Proc. Interspeech, 2018.
  8. “Who said that?: Audio-visual speaker diarisation of real-world meetings,” in Proc. Interspeech, 2019, pp. 371–375.
  9. “Spot the conversation: speaker diarisation in the wild,” in Proc. Interspeech, 2020.
  10. “Target active speaker detection with audio-visual cues,” in Proc. Interspeech, 2023.
  11. “Look Who’s Talking: Active speaker detection in the wild,” in Proc. Interspeech, 2021.
  12. “Out of time: automated lip sync in the wild,” in ACCV 2016 Workshops, 2017.
  13. “Perfect match: Self-supervised embeddings for cross-modal retrieval,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 568–576, 2020.
  14. “Multi-task learning for audio-visual active speaker detection,” The ActivityNet Large-Scale Activity Recognition Challenge, vol. 4, 2019.
  15. “Active speakers in context,” in Proc. CVPR, 2020, pp. 12465–12474.
  16. “End-to-end active speaker detection,” in Proc. ECCV. Springer, 2022, pp. 126–143.
  17. “Learning long-term spatial-temporal graphs for active speaker detection,” in Proc. ECCV. Springer, 2022, pp. 371–387.
  18. “Maas: Multi-modal assignation for active speaker detection,” in Proc. ICCV, 2021, pp. 265–274.
  19. “How to design a three-stage architecture for audio-visual active speaker detection in the wild,” in Proc. ICCV, 2021, pp. 1193–1203.
  20. “Adam: A method for stochastic optimization,” in Proc. ICLR, San Diega, CA, USA, 2015.
Citations (8)

Summary

We haven't generated a summary for this paper yet.