Papers
Topics
Authors
Recent
2000 character limit reached

Audio-Visual Approach For Multimodal Concurrent Speaker Detection

Published 1 Jul 2024 in eess.AS and eess.IV | (2407.01774v2)

Abstract: Concurrent Speaker Detection (CSD), the task of identifying active speakers and their overlaps in an audio signal, is essential for various audio applications, including meeting transcription, speaker diarization, and speech separation. This study presents a multimodal deep learning approach that integrates audio and visual information. The proposed model utilizes an early fusion strategy, combining audio and visual features through cross-modal attention mechanisms with a learnable [CLS] token to capture key audio-visual relationships. The model is extensively evaluated on two real-world datasets, the established AMI dataset and the recently introduced EasyCom dataset. Experiments validate the effectiveness of the multimodal fusion strategy. An ablation study further supports the design choices and the model's training procedure. As this is the first work reporting CSD results on the challenging EasyCom dataset, the findings demonstrate the potential of the proposed multimodal approach for \ac{CSD} in real-world scenarios.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.