Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval (2310.13451v1)

Published 20 Oct 2023 in cs.SD, cs.CV, cs.IR, cs.MM, and eess.AS

Abstract: The cross-modal retrieval model leverages the potential of triple loss optimization to learn robust embedding spaces. However, existing methods often train these models in a singular pass, overlooking the distinction between semi-hard and hard triples in the optimization process. The oversight of not distinguishing between semi-hard and hard triples leads to suboptimal model performance. In this paper, we introduce a novel approach rooted in curriculum learning to address this problem. We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets. In the first stage, the model is trained with a set of semi-hard triplets, starting from a low-loss base. Subsequently, in the second stage, we augment the embeddings using an interpolation technique. This process identifies potential hard negatives, alleviating issues arising from high-loss functions due to a scarcity of hard triples. Our approach then applies hard triplet mining in the augmented embedding space to further optimize the model. Extensive experimental results conducted on two audio-visual datasets show a significant improvement of approximately 9.8% in terms of average Mean Average Precision (MAP) over the current state-of-the-art method, MSNSCA, for the Audio-Visual Cross-Modal Retrieval (AV-CMR) task on the AVE dataset, indicating the effectiveness of our proposed method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Deep canonical correlation analysis. In ICML, volume 28 of Proceedings of Machine Learning Research, pages 1247–1255, Atlanta, Georgia, USA, 17–19 Jun 2013.
  2. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  3. Contrastive curriculum learning for sequential user behavior modeling via data augmentation. In CIKM, pages 3737–3746, 2021.
  4. On the power of curriculum learning in training deep networks. In International conference on machine learning, pages 2535–2544. PMLR, 2019.
  5. Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609, 2021.
  6. Canonical correlation analysis: An overview with application to learning methods. Neural Computation., Vol.16(no.12):2639–2664, 2004.
  7. Curricularface: Adaptive curriculum learning loss for deep face recognition. In CVPR, June 2020.
  8. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  9. Embedding expansion: Augmentation in embedding space for deep metric learning. In CVPR, pages 7255–7264, 2020.
  10. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst., Vol.10(no.5):pp.365–377, 2000.
  11. Competence-based multimodal curriculum learning for medical report generation. arXiv preprint arXiv:2206.14579, 2022.
  12. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages pp.8748–8763. PMLR, 18–24 Jul 2021.
  13. Cluster canonical correlation analysis. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, pages pp.823–831, Reykjavik, Iceland, 2014. JMLR.org.
  14. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  15. Audio-visual event localization in unconstrained videos. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, volume 11206 of Lecture Notes in Computer Science, pages pp.252–268, Munich, Germany, 2018. Springer.
  16. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, page pp.154–162, New York, NY, USA, 2017. Association for Computing Machinery.
  17. Videoadviser: Video knowledge distillation for multimodal transfer learning. IEEE Access, 2023.
  18. Category-based deep cca for fine-grained venue discovery from multimodal data. IEEE transactions on neural networks and learning systems, 30(4):pp.1250–1258, 2018.
  19. Complete cross-triplet loss in label space for audio-visual cross-modal retrieval. In 2022 IEEE International Symposium on Multimedia (ISM), pages 1–9. IEEE, 2022.
  20. Learning explicit and implicit dual common subspaces for audio-visual cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications and Applications, vol.19(no.2s):pp.1–23, 2023.
  21. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In 2018 IEEE International Symposium on Multimedia, ISM 2018, Taichung, Taiwan, December 10-12, 2018, pages pp.143–150, Taichung, Taiwan, 2018. IEEE Computer Society.
  22. Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(3):pp.1–23, 2020.
  23. Musictm-dataset for joint representation learning among sheet music, lyrics, and musical audio. In National Conference on Sound and Music Technology, pages pp.78–89, Singapore, 2020. Springer, Springer.
  24. Multi-scale network with shared cross-attention for audio–visual correlation learning. Neural Computing and Applications, pages 1–15, 2023.
  25. Deep supervised cross-modal retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages pp.10394–10403, Long Beach, CA, USA, 2019. Computer Vision Foundation / IEEE.
  26. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2):pp.1–23, 2020.
  27. Visual to sound: Generating natural sound for videos in the wild. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages pp.3550–3558, Salt Lake City, UT, USA, 2018. Computer Vision Foundation / IEEE Computer Society.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Donghuo Zeng (22 papers)
  2. Kazushi Ikeda (19 papers)
Citations (1)