Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Multi-View Approach To Audio-Visual Speaker Verification (2102.06291v1)

Published 11 Feb 2021 in cs.SD, cs.LG, eess.AS, and eess.IV

Abstract: Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then propose a novel approach to handle cross-modal verification at test time. Specifically, we investigate unimodal and concatenation based AV fusion and report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset using our best system. As these methods lack the ability to do cross-modal verification, we introduce a multi-view model which uses a shared classifier to map audio and video into the same space. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kritika Singh (9 papers)
  2. Jiatong Zhou (3 papers)
  3. Lorenzo Torresani (73 papers)
  4. Nayan Singhal (7 papers)
  5. Yatharth Saraf (21 papers)
  6. Leda Sarı (6 papers)
Citations (36)

Summary

We haven't generated a summary for this paper yet.