Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-visual Speaker Recognition with a Cross-modal Discriminative Network (2008.03894v1)

Published 10 Aug 2020 in eess.AS

Abstract: Audio-visual speaker recognition is one of the tasks in the recent 2019 NIST speaker recognition evaluation (SRE). Studies in neuroscience and computer science all point to the fact that vision and auditory neural signals interact in the cognitive process. This motivated us to study a cross-modal network, namely voice-face discriminative network (VFNet) that establishes the general relation between human voice and face. Experiments show that VFNet provides additional speaker discriminative information. With VFNet, we achieve 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ruijie Tao (25 papers)
  2. Rohan Kumar Das (50 papers)
  3. Haizhou Li (286 papers)
Citations (37)

Summary

We haven't generated a summary for this paper yet.