Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-based Multimodal Fusion (2106.03821v2)

Published 7 Jun 2021 in cs.SD, cs.CL, cs.CV, and eess.AS

Abstract: It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertainty-based multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Baptiste Pouthier (1 paper)
  2. Laurent Pilati (1 paper)
  3. Leela K. Gudupudi (1 paper)
  4. Charles Bouveyron (20 papers)
  5. Frederic Precioso (30 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com