Target Active Speaker Detection with Audio-visual Cues (2305.12831v3)

Published 22 May 2023 in eess.AS and cs.SD

Abstract: In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.

Authors (4)

Yidi Jiang (18 papers)
Ruijie Tao (25 papers)
Zexu Pan (36 papers)
Haizhou Li (286 papers)

Citations (14)

View on Semantic Scholar

Summary

The paper introduces TS-TalkNet, incorporating pre-enrolled speaker embeddings to significantly improve active speaker detection.
It employs audio and visual temporal encoders with cross-attention and transformer-based self-attention to robustly synchronize multi-modal inputs.
Experimental results on AVA and ASW datasets show marked improvements in mAP, AP, AUC, and reduced EER, affirming its efficacy in complex settings.

Target Active Speaker Detection with Audio-visual Cues: A Comprehensive Analysis

The paper introduces Target Speaker TalkNet (TS-TalkNet), an innovative approach for enhancing active speaker detection (ASD) by integrating audio-visual cues with pre-enrolled speaker embeddings. Traditional ASD methodologies predominantly focus on audio-visual synchronization, relying heavily on the visual quality of lip regions for effective detection. However, TS-TalkNet extends this paradigm by leveraging a pre-enrolled speaker model to improve detection accuracy, particularly in real-world scenarios where high-resolution video may not be available.

Methodology and Framework

TS-TalkNet integrates audio-visual cues with target speaker embeddings through a multi-stage process:

Feature Representation Frontend:
- Utilizes a visual temporal encoder to capture long-term facial dynamics.
- Incorporates an audio temporal encoder based on the ResNet34 architecture with a squeeze-and-excitation (SE) module to derive audio embeddings.
- Employs a speaker encoder, specifically a pre-trained ECAPA-TDNN model, to generate robust speaker embeddings from pre-enrolled speech data.
Speaker Detection Backend:
- Employs a cross-attention module to achieve effective audio-visual synchronization.
- Merges speaker embeddings with the synchronized audio-visual embeddings using various fusion strategies, including direct concatenation and cross-attention.
- Utilizes a self-attention mechanism, modeled after the transformer architecture, to capture temporal dependencies and refine the ASD output.
Training and Loss: A cross-entropy loss function is applied to optimize frame-level ASD predictions, enhancing model accuracy on spoken vs. non-spoken labels.

Experimental Results

TS-TalkNet's empirical evaluation on the AVA-ActiveSpeaker and Active Speakers in the Wild (ASW) datasets demonstrates a clear enhancement in ASD performance. On the AVA dataset, TS-TalkNet achieved an mAP of 93.9%, marking an improvement of 1.6% over the TalkNet baseline. On the ASW test set, TS-TalkNet enhanced performance by 0.8% in AP, 0.4% in AUC, and reduced EER by 0.8%, underscoring its efficacy in complex acoustic conditions with varying video qualities.

Implications and Future Directions

The introduction of target speaker embeddings in ASD represents a significant methodological advancement that addresses the limitations of conventional audio-visual synchronization-dependent approaches. TS-TalkNet’s capability to integrate speaker characteristics provides a distinct advantage in environments with acoustic complexities and visual constraints. This development signals a shift towards more flexible, robust ASD frameworks capable of adapting to varying real-world conditions.

Moving forward, further refinement of speaker fusion mechanisms can be explored to optimize the integration of audio-visual and speaker cues. Additionally, expanding TS-TalkNet’s application to other speech-related tasks, such as speaker verification and diarization, could prove fruitful, potentially leading to more comprehensive multi-modal speech processing systems.

In summary, TS-TalkNet not only advances the capabilities of ASD frameworks but also enriches the broader field of multi-modal audio-visual processing, offering researchers novel avenues for exploration and application.

PDF Markdown