Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic Cross Attention for Audio-Visual Person Verification (2403.04661v3)

Published 7 Mar 2024 in cs.CV, cs.LG, cs.SD, and eess.AS

Abstract: Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. J. H. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74–99, 2015.
  2. M. M. Kabir, M. F. Mridha, J. Shin, I. Jahan, and A. Q. Ohi, “A survey of speaker recognition: Fundamental theories, recognition methods and opportunities,” IEEE Access, vol. 9, pp. 79236–79263, 2021.
  3. M. Wang and W. Deng, “Deep face recognition: A survey,” Neurocomputing, vol. 429, pp. 215–244, 2021.
  4. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in IEEE Conference on CVPR, pp. 1701–1708, 2014.
  5. J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in IEEE/CVF Conference on CVPR, pp. 4685–4694, 2019.
  6. Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in Computer Vision – ECCV 2016, pp. 499–515, 2016.
  7. H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5265–5274, 2018.
  8. R. G. Praveen, E. Granger, and P. Cardinal, “Recursive joint attention for audio-visual fusion in regression based emotion recognition,” in IEEE ICASSP, pp. 1–5, 2023.
  9. G. P. Rajasekhar and J. Alam, “Audio-visual speaker verification via joint cross-attention,” in Speech and Computer: 25th International Conference, SPECOM 2023, Dharwad, India, November 29 – December 2, 2023, Proceedings, Part II, p. 18–31, 2023.
  10. D. Wang, T. Zhao, W. Yu, N. V. Chawla, and M. Jiang, “Deep multimodal complementarity learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 12, pp. 10213–10224, 2023.
  11. R. G. Praveen, E. Granger, and P. Cardinal, “Cross attentional audio-visual fusion for dimensional emotion recognition,” in IEEE FG, pp. 1–8, 2021.
  12. J.-T. Lee, M. Jain, H. Park, and S. Yun, “Cross-attentional audio-visual fusion for weakly-supervised action localization,” in International Conference on Learning Representations, 2021.
  13. P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12113–12132, 2023.
  14. J.-T. Lee, S. Yun, and M. Jain, “Leaky gated cross-attention for weakly supervised multi-modal temporal action localization,” in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 817–826, 2022.
  15. X. Wang, J. He, Z. Jin, M. Yang, Y. Wang, and H. Qu, “M2lens: Visualizing and explaining multimodal models for sentiment analysis,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 802–812, 2022.
  16. Z. Chen, S. Wang, and Y. Qian, “Multi-modality matters: A performance leap on voxceleb,” in Proc. Interspeech, pp. 2252–2256, 2020.
  17. Y. Qian, Z. Chen, and S. Wang, “Audio-visual deep neural network for robust person verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1079–1092, 2021.
  18. Seyed, C. Greenberg, E. Singer, D. Olson, L. Mason, and J. Hernandez-Cordero, “The 2019 nist audio-visual speaker recognition evaluation,” The Speaker and Language Recognition Workshop: Odyssey 2020, Tokyo, -1, 2020-05-18 2020.
  19. J. Alam, G. Boulianne, L. Burget, M. Dahmane, M. S. Diez, O. Glembek, M. Lalonde, A. D. Lozano, P. Matějka, P. Mizera, L. Mošner, C. Noiseux, J. Monteiro, O. Novotný, O. Plchot, A. J. Rohdin, A. Silnova, J. Slavíček, T. Stafylakis, P.-L. St-Charles, S. Wang, and H. Zeinali, “Analysis of abc submission to nist sre 2019 cmn and vast challenge,” in Proceedings of Odyssey 2020 The Speaker and Language Recognition Workshop, vol. 2020, pp. 289–295, International Speech Communication Association, 2020.
  20. L. Sarı, K. Singh, J. Zhou, L. Torresani, N. Singhal, and Y. Saraf, “A multi-view approach to audio-visual speaker verification,” in IEEE ICASSP, pp. 6194–6198, 2021.
  21. S. Shon, T.-H. Oh, and J. Glass, “Noise-tolerant audio-visual online person verification using an attention-based neural network fusion,” in IEEE ICASSP, pp. 3995–3999, 2019.
  22. S. Hörmann, A. Moiz, M. Knoche, and G. Rigoll, “Attention fusion for audio-visual person verification using multi-scale features,” in IEEE FG, pp. 281–285, 2020.
  23. B. Mocanu and T. Ruxandra, “Active speaker recognition using cross attention audio-video fusion,” in Proc. of the EUVIP, pp. 1–6, 2022.
  24. M. Liu, K. A. Lee, L. Wang, H. Zhang, C. Zeng, and J. Dang, “Cross-modal audio-visual co-learning for text-independent speaker verification,” in IEEE ICASSP, pp. 1–5, 2023.
  25. R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing help?,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
  26. H. Noh, T. You, J. Mun, and B. Han, “Regularizing deep neural networks by noise: Its interpretation and optimization,” in NIPS, p. 5115–5124, 2017.
  27. K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive Statistics Pooling for Deep Speaker Embedding,” in Proc. Interspeech, pp. 2252–2256, 2018.
  28. A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
  29. Y. Tu, W. Lin, and M.-W. Mak, “A survey on text-dependent and text-independent speaker verification,” IEEE Access, vol. 10, pp. 99038–99049, 2022.
  30. N. Brümmer and J. du Preez, “Application-independent evaluation of speaker detection,” Computer Speech & Language, vol. 20, no. 2, pp. 230–275, 2006.
  31. A. Nagrani, J. S. Chung, J. Huh, A. Brown, E. Coto, W. Xie, M. McLaren, D. A. Reynolds, and A. Zisserman, “Voxsrc 2020: The second voxceleb speaker recognition challenge,” arXiv preprint arXiv:2012.06867, 2020.
  32. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech, pp. 3830–3834, 2020.
  33. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, pp. 770–778, 2016.
  34. R. Tao, K. A. Lee, Z. Shi, and H. Li, “Speaker recognition with two-step multi-modal deep cleansing,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. R. Gnana Praveen (15 papers)
  2. Jahangir Alam (16 papers)
Citations (1)