Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Muse: Multi-modal target speaker extraction with visual cues (2010.07775v3)

Published 15 Oct 2020 in eess.AS, cs.MM, cs.SD, and eess.IV

Abstract: Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip movement also serves as an informative cue. Motivated by this idea, we study a novel technique to use speech-lip visual cues to extract reference target speech directly from mixture speech during inference time, without the need of pre-recorded reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence. MuSE not only outperforms other competitive baselines in terms of SI-SDR and PESQ, but also shows consistent improvement in cross-dataset evaluations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zexu Pan (36 papers)
  2. Ruijie Tao (25 papers)
  3. Chenglin Xu (14 papers)
  4. Haizhou Li (286 papers)
Citations (41)

Summary

We haven't generated a summary for this paper yet.