Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder (2404.09509v1)

Published 15 Apr 2024 in cs.CV

Abstract: Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of available information. This paper introduces a novel framework within an unsupervised setting for learning voice-face associations. By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task. Empirical evidence demonstrates that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks, improving verification by approximately 3%, matching by about 2.5%, and retrieval by around 1.3%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Vggface2: A dataset for recognising faces across pose and age, 2018.
  2. Self-lifting: A novel framework for unsupervised voice-face association learning. In Proceedings of the 2022 International Conference on Multimedia Retrieval, pages 527–535, 2022.
  3. Local-global contrast for learning voice-face representations. In 2023 IEEE International Conference on Image Processing (ICIP), pages 51–55, 2023.
  4. Hearing like seeing: Improving voice-face interactions and associations via adversarial deep semantic matching network. In Proceedings of the 28th ACM International Conference on Multimedia, page 448–455, New York, NY, USA, 2020. Association for Computing Machinery.
  5. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
  6. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. 2020.
  7. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742, 2006.
  8. Face-voice matching using cross-modal embeddings. In Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, page 1011–1019, New York, NY, USA, 2018. Association for Computing Machinery.
  9. Putting the face to the voice: Matching identity across modality, 2003.
  10. On learning associations of faces and voices, 2018.
  11. Align before fuse: Vision and language representation learning with momentum distillation, 2021.
  12. Decoupled weight decay regularization, 2019.
  13. Learnable pins: Cross-modal embeddings for person identity, 2018.
  14. Seeing voices and hearing faces: Cross-modal biometric matching, 2018.
  15. Voxceleb: A large-scale speaker identification dataset. In Interspeech 2017. ISCA, Aug. 2017.
  16. Deep latent space learning for cross-modal mapping of audio and visual signals, 2019.
  17. Disentangled representation learning for cross-modal biometric matching. IEEE Transactions on Multimedia, 24:1763–1774, 2022.
  18. Fusion and orthogonal projection for improved face-voice association, 2021.
  19. FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2015.
  20. Matching novel face and voice identity using static and dynamic facial images, 2016.
  21. Going deeper with convolutions, 2014.
  22. Self-supervised training of speaker encoder with multi-modal diverse positive pairs, 2022.
  23. Attention is all you need, 2023.
  24. Learning discriminative joint embeddings for efficient face and voice association. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, page 1881–1884, New York, NY, USA, 2020. Association for Computing Machinery.
  25. Multi-similarity loss with general pair weighting for deep metric learning, 2020.
  26. Seeking the shape of sound: An adaptive framework for learning voice-face association, 2021.
  27. Disjoint mapping network for cross-modal matching of voices and faces, 2018.
  28. Face reconstruction from voice using generative adversarial networks. Advances in neural information processing systems, 32, 2019.
  29. Adversarial-metric learning for audio-visual cross-modal matching. IEEE Transactions on Multimedia, 24:338–351, 2022.
  30. Unsupervised voice-face representation learning by cross-modal prototype contrast, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com