Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels (2403.13480v1)

Published 20 Mar 2024 in cs.CV, cs.IR, and cs.MM

Abstract: Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges -- enforcing the multimodal samples to \emph{align incorrect semantics} and \emph{widen the heterogeneous gap}, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multi-modal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these two components leverage the inherent correlation among multi-modal data to facilitate effective cost function. The experiments on three widely-used cross-modal retrieval datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. A. Zheng, M. Hu, B. Jiang, Y. Huang, Y. Yan, and B. Luo, “Adversarial-metric learning for audio-visual cross-modal matching,” IEEE Transactions on Multimedia, vol. 24, pp. 338–351, 2021.
  2. J. Zhang, Y. Yu, S. Tang, J. Wu, and W. Li, “Variational autoencoder with cca for audio–visual cross-modal retrieval,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 19, no. 3s, pp. 1–21, 2023.
  3. S. Zhang, Y. Chen, Y. Sun, F. Wang, H. Shi, and H. Wang, “Lois: Looking out of instance semantics for visual question answering,” IEEE Transactions on Multimedia, 2023.
  4. S. Sheng, A. Singh, V. Goswami, J. Magana, T. Thrush, W. Galuba, D. Parikh, and D. Kiela, “Human-adversarial visual question answering,” Advances in Neural Information Processing Systems, vol. 34, pp. 20 346–20 359, 2021.
  5. T. Qian, R. Cui, J. Chen, P. Peng, X. Guo, and Y.-G. Jiang, “Locate before answering: Answer guided question localization for video question answering,” IEEE Transactions on Multimedia, 2023.
  6. B. Zhu, C.-W. Ngo, and W.-K. Chan, “Learning from web recipe-image pairs for food recognition: Problem, baselines and performance,” IEEE Transactions on Multimedia, vol. 24, pp. 1175–1185, 2021.
  7. H. Wang, D. Sahoo, C. Liu, K. Shu, P. Achananuparp, E.-p. Lim, and S. C. Hoi, “Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism,” IEEE Transactions on Multimedia, vol. 24, pp. 2515–2525, 2021.
  8. Y. Zhang, W. Ou, Y. Shi, J. Deng, X. You, and A. Wang, “Deep medical cross-modal attention hashing,” World Wide Web, vol. 25, no. 4, pp. 1519–1536, 2022.
  9. H. Ma, H. Zhao, Z. Lin, A. Kale, Z. Wang, T. Yu, J. Gu, S. Choudhary, and X. Xie, “Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 051–18 061.
  10. V. W. Liang, Y. Zhang, Y. Kwon, S. Yeung, and J. Y. Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 612–17 625, 2022.
  11. X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, and J. Bailey, “Normalized loss functions for deep learning with noisy labels,” in International conference on machine learning.   PMLR, 2020, pp. 6543–6553.
  12. Z. Sun, F. Shen, D. Huang, Q. Wang, X. Shu, Y. Yao, and J. Tang, “Pnp: Robust learning from noisy labels by probabilistic noise prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5311–5320.
  13. J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng, “Meta-weight-net: Learning an explicit mapping for sample weighting,” Advances in neural information processing systems, vol. 32, 2019.
  14. F. Ma, Y. Wu, X. Yu, and Y. Yang, “Learning with noisy labels via self-reweighting from class centroids,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6275–6285, 2021.
  15. H. Cheng, Z. Zhu, X. Li, Y. Gong, X. Sun, and Y. Liu, “Learning with instance-dependent label noise: A sample sieve approach,” in International Conference on Learning Representations, 2021.
  16. N. Karim, M. N. Rizve, N. Rahnavard, A. Mian, and M. Shah, “Unicon: Combating label noise through uniform selection and contrastive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9676–9686.
  17. G. Zheng, A. H. Awadallah, and S. Dumais, “Meta label correction for noisy label learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 11 053–11 061.
  18. L. Li, L. Chen, Y. Huang, Z. Zhang, S. Zhang, and J. Xiao, “The devil is in the labels: Noisy label correction for robust scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 869–18 878.
  19. P. Hu, X. Peng, H. Zhu, L. Zhen, and J. Lin, “Learning cross-modal retrieval with noisy labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5403–5413.
  20. D. Arpit, S. K. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. C. Courville, Y. Bengio et al., “A closer look at memorization in deep networks,” in ICML, 2017.
  21. T. Xu, X. Liu, Z. Huang, D. Guo, R. Hong, and M. Wang, “Early-learning regularized contrastive learning for cross-modal retrieval with noisy labels,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 629–637.
  22. M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” Advances in neural information processing systems, vol. 26, 2013.
  23. F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “Vse++: Improving visual-semantic embeddings with hard negatives,” arXiv preprint arXiv:1707.05612, 2017.
  24. K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 201–216.
  25. K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4654–4662.
  26. P. Hu, D. Peng, Y. Sang, and Y. Xiang, “Multi-view linear discriminant analysis network,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5352–5365, 2019.
  27. P. Hu, D. Peng, X. Wang, and Y. Xiang, “Multimodal adversarial network for cross-modal retrieval,” Knowledge-Based Systems, vol. 180, pp. 38–50, 2019.
  28. P. Hu, L. Zhen, D. Peng, and P. Liu, “Scalable deep multimodal learning for cross-modal retrieval,” in Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, 2019, pp. 635–644.
  29. L. Zhen, P. Hu, X. Wang, and D. Peng, “Deep supervised cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 394–10 403.
  30. B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 154–162.
  31. M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” in International conference on machine learning.   PMLR, 2018, pp. 4334–4343.
  32. S. Li, T. Liu, J. Tan, D. Zeng, and S. Ge, “Trustable co-label learning from multiple noisy annotators,” IEEE Transactions on Multimedia, 2021.
  33. Y. Chen, M. Liu, X. Wang, F. Wang, A.-A. Liu, and Y. Wang, “Refining noisy labels with label reliability perception for person re-identification,” IEEE Transactions on Multimedia, 2023.
  34. J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,” in International Conference on Learning Representations, 2019.
  35. K. Nishi, Y. Ding, A. Rich, and T. Hollerer, “Augmentation strategies for learning with noisy labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8022–8031.
  36. M. Yang, Z. Huang, P. Hu, T. Li, J. Lv, and X. Peng, “Learning with twin noisy labels for visible-infrared person re-identification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 14 308–14 317.
  37. J. Li, G. Li, F. Liu, and Y. Yu, “Neighborhood collective estimation for noisy label identification and correction,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV.   Springer, 2022, pp. 128–145.
  38. S. Li, X. Xia, S. Ge, and T. Liu, “Selective-supervised contrastive learning with noisy labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 316–325.
  39. S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez-Granda, “Early-learning regularization prevents memorization of noisy labels,” Advances in neural information processing systems, vol. 33, pp. 20 331–20 342, 2020.
  40. M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020.
  41. Y. Asano, M. Patrick, C. Rupprecht, and A. Vedaldi, “Labelling unlabelled videos from scratch with multi-modal self-supervision,” Advances in Neural Information Processing Systems, vol. 33, pp. 4660–4671, 2020.
  42. K. S. Tai, P. D. Bailis, and G. Valiant, “Sinkhorn label allocation: Semi-supervised classification via annealed self-training,” in International Conference on Machine Learning.   PMLR, 2021, pp. 10 065–10 075.
  43. Z. Ge, S. Liu, Z. Li, O. Yoshie, and J. Sun, “Ota: Optimal transport assignment for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 303–312.
  44. T. Afouras, Y. M. Asano, F. Fagan, A. Vedaldi, and F. Metze, “Self-supervised object detection from audio-visual correspondence,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 575–10 586.
  45. I. Redko, N. Courty, R. Flamary, and D. Tuia, “Optimal transport for multi-source domain adaptation under target shift,” in The 22nd International Conference on Artificial Intelligence and Statistics.   PMLR, 2019, pp. 849–858.
  46. K. Fatras, H. Naganuma, and I. Mitliagkas, “Optimal transport meets noisy label robust loss and mixup regularization for domain adaptation,” in Conference on Lifelong Learning Agents.   PMLR, 2022, pp. 966–981.
  47. H. Peng, M. Sun, and P. Li, “Optimal transport for long-tailed recognition with learnable cost matrix,” in International Conference on Learning Representations, 2022.
  48. H. Wang, M. Xia, Y. Li, Y. Mao, L. Feng, G. Chen, and J. Zhao, “Solar: Sinkhorn label refinery for imbalanced partial-label learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 8104–8117, 2022.
  49. C. Feng, Y. Ren, and X. Xie, “Ot-filter: An optimal transport filter for learning with noisy labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 164–16 174.
  50. G. Peyré, M. Cuturi et al., “Computational optimal transport: With applications to data science,” Foundations and Trends® in Machine Learning, vol. 11, no. 5-6, pp. 355–607, 2019.
  51. H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran, “Self-supervised learning by cross-modal audio-video clustering,” Advances in Neural Information Processing Systems, vol. 33, pp. 9758–9770, 2020.
  52. H. Zhang, H. Xu, T.-E. Lin, and R. Lyu, “Discovering new intents with deep aligned clustering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 16, 2021, pp. 14 365–14 373.
  53. L. Chapel, M. Z. Alaya, and G. Gasso, “Partial optimal tranport with applications on positive-unlabeled learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 2903–2913, 2020.
  54. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  55. N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 251–260.
  56. F. Feng, X. Wang, and R. Li, “Cross-modal retrieval with correspondence autoencoder,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 7–16.
  57. T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in Proceedings of the ACM international conference on image and video retrieval, 2009, pp. 1–9.
  58. Y. Peng, J. Qi, and Y. Yuan, “Modality-specific cross-modal similarity measurement with recurrent attention network,” IEEE Transactions on Image Processing, vol. 27, no. 11, pp. 5585–5599, 2018.
  59. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  60. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  61. J. H. Lau and T. Baldwin, “An empirical evaluation of doc2vec with practical insights into document embedding generation,” in Proceedings of the 1st Workshop on Representation Learning for NLP, 2016, pp. 78–86.
  62. J. Rupnik and J. Shawe-Taylor, “Multi-view canonical correlation analysis,” in Conference on data mining and data warehouses (SiKDD 2010), 2010, pp. 1–4.
  63. A. Sharma and D. W. Jacobs, “Bypassing synthesis: Pls for face recognition with pose, low-resolution and sketch,” in CVPR 2011.   IEEE, 2011, pp. 593–600.
  64. G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International conference on machine learning.   PMLR, 2013, pp. 1247–1255.
  65. W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning,” in International conference on machine learning.   PMLR, 2015, pp. 1083–1092.
  66. M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view discriminant analysis,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 188–194, 2015.
  67. L. Zhang, B. Ma, G. Li, Q. Huang, and Q. Tian, “Generalized semi-supervised and structured subspace learning for cross-modal retrieval,” IEEE Transactions on Multimedia, vol. 20, no. 1, pp. 128–141, 2017.
  68. Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Cross-modal retrieval with cnn visual features: A new baseline,” IEEE transactions on cybernetics, vol. 47, no. 2, pp. 449–460, 2016.
  69. X. He, Y. Peng, and L. Xie, “A new benchmark and approach for fine-grained cross-media retrieval,” in Proceedings of the 27th ACM international conference on multimedia, 2019, pp. 1740–1748.
  70. P. Hu, H. Zhu, X. Peng, and J. Lin, “Semi-supervised multi-modal learning with balanced spectral decomposition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 99–106.
  71. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haochen Han (6 papers)
  2. Minnan Luo (61 papers)
  3. Huan Liu (283 papers)
  4. Fang Nan (8 papers)

Summary

We haven't generated a summary for this paper yet.