Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech (2309.08408v1)

Published 15 Sep 2023 in cs.SD and eess.AS

Abstract: Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of SI-SNR.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Junjie Li (98 papers)
  2. Ruijie Tao (25 papers)
  3. Zexu Pan (36 papers)
  4. Meng Ge (29 papers)
  5. Shuai Wang (466 papers)
  6. Haizhou Li (286 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.