Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification (2302.11254v1)

Published 22 Feb 2023 in cs.SD, cs.CV, cs.LG, eess.AS, and eess.IV

Abstract: Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Meng Liu (112 papers)
  2. Kong Aik Lee (77 papers)
  3. Longbiao Wang (46 papers)
  4. Hanyi Zhang (12 papers)
  5. Chang Zeng (18 papers)
  6. Jianwu Dang (41 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.