Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan (2404.09342v3)

Published 14 Apr 2024 in cs.CV, cs.SD, and eess.AS

Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. “An introduction to biometric recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 1, pp. 4–20, 2004.
  2. “Multimodal biometrics: An overview,” in 12th European Signal Processing Conference 2004, 2004, pp. 1221–1224.
  3. “Speaker recognition in realistic scenario using multimodal data,” arXiv preprint arXiv:2302.13033, 2023.
  4. “‘putting the face to the voice’: Matching identity across modality,” Current Biology, vol. 13, no. 19, pp. 1709–1714, 2003.
  5. “Seeing voices and hearing faces: Cross-modal biometric matching,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8427–8436.
  6. “Face-voice matching using cross-modal embeddings,” in ACM Multimedia Conference on Multimedia Conference 2018, Susanne Boll, Kyoung Mu Lee, Jiebo Luo, Wenwu Zhu, Hyeran Byun, Chang Wen Chen, Rainer Lienhart, and Tao Mei, Eds. 2018, pp. 1011–1019, ACM.
  7. “Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network,” in Proc. Interspeech 2020, 2020, pp. 2242–2246.
  8. “Fusion and orthogonal projection for improved face-voice association,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7057–7061.
  9. “Deep latent space learning for cross-modal mapping of audio and visual signals,” in 2019 Digital Image Computing: Techniques and Applications (DICTA). IEEE, 2019, pp. 1–7.
  10. “Single-branch network for multimodal training,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  11. Jay Mathews, “Half of the world is bilingual. What’s our problem?,” www.washingtonpost.com/local/education/half-the-world-is-bilingual-whats-our-problem/2019/04/24/1c2b0cc2-6625-11e9-a1b6-b29b90efa879_story, 2019, [Online; accessed 16-April-2021].
  12. “Cross-modal speaker verification and recognition: A multilingual perspective,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1682–1691.
  13. “Learnable pins: Cross-modal embeddings for person identity,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 71–88.
  14. “Deep face recognition,” 2015.
  15. “Utterance-level aggregation for speaker recognition in the wild,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5791–5795.
Citations (3)

Summary

We haven't generated a summary for this paper yet.