Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 178 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 231 tok/s Pro
GPT OSS 120B 427 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Cross-modal Active Complementary Learning with Self-refining Correspondence (2310.17468v2)

Published 26 Oct 2023 in cs.CV and cs.LG

Abstract: Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE international conference on computer vision, pages 1–9, 2015.
  2. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  3. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
  4. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1218–1226, 2021.
  5. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. arXiv preprint arXiv:2303.12501, 2023.
  6. Unsupervised contrastive cross-modal hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2022.
  7. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612, 2017.
  8. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV), pages 201–216, 2018.
  9. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15789–15798, 2021.
  10. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  11. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  12. Learning with noisy correspondence for cross-modal matching. Advances in Neural Information Processing Systems, 34:29406–29419, 2021.
  13. Robust multi-view clustering with incomplete information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1055–1069, 2022.
  14. Partially view-aligned representation learning with noise-robust contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1134–1143, 2021.
  15. Deep double incomplete multi-view multi-label learning with incomplete labels and missing views. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  16. Adversarial learning for robust deep clustering. Advances in Neural Information Processing Systems, 33:9098–9108, 2020.
  17. Robust video-text retrieval via noisy pair calibration. IEEE Transactions on Multimedia, 2023.
  18. Learning with twin noisy labels for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14308–14317, 2022.
  19. Deep evidential learning with noisy correspondence for cross-modal retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4948–4956, 2022.
  20. Cross-modal retrieval with partially mismatched pairs. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–15, 2023.
  21. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394, 2020.
  22. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31, 2018.
  23. Bicro: Noisy correspondence rectification for multi-modality data via bi-directional cross-modal similarity consistency. arXiv preprint arXiv:2303.12419, 2023.
  24. Noisy correspondence learning with meta similarity correction. arXiv preprint arXiv:2304.06275, 2023.
  25. A closer look at memorization in deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 233–242. PMLR, 06–11 Aug 2017.
  26. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  27. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9346–9355, 2019.
  28. Naresh Manwani and PS Sastry. Noise tolerance under risk minimization. IEEE transactions on cybernetics, 43(3):1146–1151, 2013.
  29. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
  30. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
  31. Learning cross-modal retrieval with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5403–5413, 2021.
  32. Normalized loss functions for deep learning with noisy labels. In International conference on machine learning, pages 6543–6553. PMLR, 2020.
  33. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.
  34. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  35. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  36. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  37. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
  38. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  39. Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
  40. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
  41. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
  42. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4654–4662, 2019.
  43. Consensus-aware visual-semantic embedding for image-text matching. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 18–34. Springer, 2020.
  44. Multi-view visual semantic embedding. IJCAI, 2022.
  45. Camp: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5764–5773, 2019.
  46. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12655–12663, 2020.
  47. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10921–10930, 2020.
  48. Cross-modal graph matching network for image-text retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(4):1–23, 2022.
  49. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  50. Show your faith: Cross-modal confidence-aware network for image-text matching. 2022.
  51. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15661–15670, 2022.
  52. More than just attention: Improving cross-modal attentions with contrastive constraints for image-text matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4432–4440, 2023.
  53. Regularizing visual semantic embedding with contrastive learning for image-text matching. IEEE Signal Processing Letters, 29:1332–1336, 2022.
  54. Mack: multimodal aligned conceptual knowledge for unpaired image-text matching. Advances in Neural Information Processing Systems, 35:7892–7904, 2022.
  55. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems, 35:6704–6719, 2022.
  56. Fine-grained image-text matching by cross-modal hard aligning network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19275–19284, 2023.
  57. Learning semantic relationship among instances for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15159–15168, 2023.
  58. Nim-nets: Noise-aware incomplete multi-view learning networks. IEEE Transactions on Image Processing, 32:175–189, 2022.
  59. Rono: Robust discriminative learning with noisy labels for 2d-3d cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11610–11619, June 2023.
  60. Graph matching with bi-level noisy correspondence. In Proceedings of the IEEE/CVF international conference on computer vision, 2023.
  61. Maximum block energy guided robust subspace clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2652–2659, 2022.
  62. Semi-supervised structured subspace learning for multi-view clustering. IEEE Transactions on Image Processing, 31:1–14, 2021.
  63. L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. Advances in neural information processing systems, 32, 2019.
  64. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 322–330, 2019.
  65. An ensemble model for combating label noise. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 608–617, 2022.
  66. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952, 2017.
  67. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5552–5560, 2018.
  68. Selc: Self-ensemble label correction improves learning with noisy labels. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 3278–3284. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Main Track.
Citations (14)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.