Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention (2403.04654v3)

Published 7 Mar 2024 in cs.CV, cs.SD, and eess.AS

Abstract: Person or identity verification has been recently gaining a lot of attention using audio-visual fusion as faces and voices share close associations with each other. Conventional approaches based on audio-visual fusion rely on score-level or early feature-level fusion techniques. Though existing approaches showed improvement over unimodal systems, the potential of audio-visual fusion for person verification is not fully exploited. In this paper, we have investigated the prospect of effectively capturing both the intra- and inter-modal relationships across audio and visual modalities, which can play a crucial role in significantly improving the fusion performance over unimodal systems. In particular, we introduce a recursive fusion of a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework in a recursive fashion to progressively refine the feature representations that can efficiently capture the intra-and inter-modal relationships. To further enhance the audio-visual feature representations, we have also explored BLSTMs to improve the temporal modeling of audio-visual feature representations. Extensive experiments are conducted on the Voxceleb1 dataset to evaluate the proposed model. Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships across audio and visual modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. J. H. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74–99, 2015.
  2. M. Wang and W. Deng, “Deep face recognition: A survey,” Neurocomputing, vol. 429, pp. 215–244, 2021.
  3. Y. Tu, W. Lin, and M.-W. Mak, “A survey on text-dependent and text-independent speaker verification,” IEEE Access, vol. 10, pp. 99038–99049, 2022.
  4. R. Tao, K. A. Lee, Z. Shi, and H. Li, “Speaker recognition with two-step multi-modal deep cleansing,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
  5. R. G. Praveen, E. Granger, and P. Cardinal, “Recursive joint attention for audio-visual fusion in regression based emotion recognition,” in IEEE ICASSP, pp. 1–5, 2023.
  6. A. Nagrani, S. Albanie, and A. Zisserman, “Seeing voices and hearing faces: Cross-modal biometric matching,” in IEEE CVPR, 2018.
  7. A. Nagrani, S. Albanie, and A. Zisserman, “Learnable pins: Cross-modal embeddings for person identity,” in Proc. of ECCV, 2018.
  8. Seyed, C. Greenberg, E. Singer, D. Olson, L. Mason, and J. Hernandez-Cordero, “The 2019 nist audio-visual speaker recognition evaluation,” The Speaker and Language Recognition Workshop: Odyssey 2020, Tokyo, -1, 2020-05-18 2020.
  9. J. Alam, G. Boulianne, L. Burget, M. Dahmane, M. S. Diez, O. Glembek, M. Lalonde, A. D. Lozano, P. Matějka, P. Mizera, L. Mošner, C. Noiseux, J. Monteiro, O. Novotný, O. Plchot, A. J. Rohdin, A. Silnova, J. Slavíček, T. Stafylakis, P.-L. St-Charles, S. Wang, and H. Zeinali, “Analysis of abc submission to nist sre 2019 cmn and vast challenge,” in Proceedings of Odyssey 2020 The Speaker and Language Recognition Workshop, vol. 2020, pp. 289–295, International Speech Communication Association, 2020.
  10. Z. Chen, S. Wang, and Y. Qian, “Multi-modality matters: A performance leap on voxceleb,” in Proc. Interspeech, pp. 2252–2256, 2020.
  11. Y. Qian, Z. Chen, and S. Wang, “Audio-visual deep neural network for robust person verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1079–1092, 2021.
  12. P. Sun, S. Zhang, Z. Liu, Y. Yuan, T. Zhang, H. Zhang, and P. Hu, “A Method of Audio-Visual Person Verification by Mining Connections between Time Series,” in Proc. INTERSPEECH 2023, pp. 3227–3231, 2023.
  13. S. Hörmann, A. Moiz, M. Knoche, and G. Rigoll, “Attention fusion for audio-visual person verification using multi-scale features,” in IEEE FG, pp. 281–285, 2020.
  14. M. Liu, K. A. Lee, L. Wang, H. Zhang, C. Zeng, and J. Dang, “Cross-modal audio-visual co-learning for text-independent speaker verification,” in IEEE ICASSP, pp. 1–5, 2023.
  15. S. Shon, T.-H. Oh, and J. Glass, “Noise-tolerant audio-visual online person verification using an attention-based neural network fusion,” in IEEE ICASSP, pp. 3995–3999, 2019.
  16. B. Duan, H. Tang, W. Wang, Z. Zong, G. Yang, and Y. Yan, “Audio-visual event localization via recursive fusion by joint co-attention,” in IEEE WACV, pp. 4012–4021, 2021.
  17. L. Sarı, K. Singh, J. Zhou, L. Torresani, N. Singhal, and Y. Saraf, “A multi-view approach to audio-visual speaker verification,” in IEEE ICASSP, pp. 6194–6198, 2021.
  18. H. Chen, H. Zhang, L. Wang, K. A. Lee, M. Liu, and J. Dang, “Self-supervised audio-visual speaker representation with co-meta learning,” in IEEE ICASSP, pp. 1–5, 2023.
  19. J.-T. Lee, M. Jain, H. Park, and S. Yun, “Cross-attentional audio-visual fusion for weakly-supervised action localization,” in Proc. of the ICLR, 2021.
  20. R. G. Praveen, E. Granger, and P. Cardinal, “Cross attentional audio-visual fusion for dimensional emotion recognition,” in IEEE FG, pp. 1–8, 2021.
  21. B. Mocanu and T. Ruxandra, “Active speaker recognition using cross attention audio-video fusion,” in Proc. of the EUVIP, pp. 1–6, 2022.
  22. W. Wang, D. Tran, and M. Feiszli, “What makes training multi-modal classification networks hard?,” in CVPR, pp. 12692–12702, 2020.
  23. K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive Statistics Pooling for Deep Speaker Embedding,” in Proc. Interspeech, pp. 2252–2256, 2018.
  24. J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in IEEE/CVF Conference on CVPR, pp. 4685–4694, 2019.
  25. A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
  26. N. Brümmer and J. du Preez, “Application-independent evaluation of speaker detection,” Computer Speech & Language, vol. 20, no. 2, pp. 230–275, 2006.
  27. A. Nagrani, J. S. Chung, J. Huh, A. Brown, E. Coto, W. Xie, M. McLaren, D. A. Reynolds, and A. Zisserman, “Voxsrc 2020: The second voxceleb speaker recognition challenge,” arXiv preprint arXiv:2012.06867, 2020.
  28. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, pp. 770–778, 2016.
  29. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech, pp. 3830–3834, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. R. Gnana Praveen (15 papers)
  2. Jahangir Alam (16 papers)
Citations (2)