REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence (2403.08224v1)
Abstract: The presence of noise in acquired data invariably leads to performance degradation in cross-modal matching. Unfortunately, obtaining precise annotations in the multimodal field is expensive, which has prompted some methods to tackle the mismatched data pair issue in cross-modal matching contexts, termed as noisy correspondence. However, most of these existing noisy correspondence methods exhibit the following limitations: a) the problem of self-reinforcing error accumulation, and b) improper handling of noisy data pair. To tackle the two problems, we propose a generalized framework termed as Rank corrElation and noisy Pair hAlf-replacing wIth memoRy (REPAIR), which benefits from maintaining a memory bank for features of matched pairs. Specifically, we calculate the distances between the features in the memory bank and those of the target pair for each respective modality, and use the rank correlation of these two sets of distances to estimate the soft correspondence label of the target pair. Estimating soft correspondence based on memory bank features rather than using a similarity network can avoid the accumulation of errors due to incorrect network identifications. For pairs that are completely mismatched, REPAIR searches the memory bank for the most matching feature to replace one feature of one modality, instead of using the original pair directly or merely discarding the mismatched pair. We conduct experiments on three cross-modal datasets, i.e., Flickr30K, MSCOCO, and CC152K, proving the effectiveness and robustness of our REPAIR on synthetic and real-world noise.
- Z. Huang, G. Niu, X. Liu, W. Ding, X. Xiao, H. Wu, and X. Peng, “Learning with noisy correspondence for cross-modal matching,” Proc. NeurIPS, vol. 34, pp. 29 406–29 419, 2021.
- F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “Vse++: Improving visual-semantic embeddings with hard negatives,” arXiv preprint arXiv:1707.05612, 2017.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.
- H. Diao, Y. Zhang, L. Ma, and H. Lu, “Similarity reasoning and filtration for image-text matching,” in Proc. AAAI, vol. 35, no. 2, 2021, pp. 1218–1226.
- Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li, “Context-aware attention network for image-text retrieval,” in Proc. Comput. Vis. Pattern Recog., 2020, pp. 3536–3545.
- Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in Proc. Eur. Conf. Comput. Vis. Springer, 2020, pp. 104–120.
- V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in Proc. Eur. Conf. Comput. Vis. Springer, 2020, pp. 214–229.
- P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
- K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 201–216.
- J. Chen, H. Hu, H. Wu, Y. Jiang, and C. Wang, “Learning the best pooling strategy for visual semantic embedding,” in Proc. Comput. Vis. Pattern Recog., 2021, pp. 15 789–15 798.
- Z. Feng, Z. Zeng, C. Guo, Z. Li, and L. Hu, “Learning from noisy correspondence with tri-partition for cross-modal matching,” IEEE Trans. Multimedia, 2023.
- H. Han, K. Miao, Q. Zheng, and M. Luo, “Noisy correspondence learning with meta similarity correction,” in Proc. Comput. Vis. Pattern Recog., 2023, pp. 7517–7526.
- Y. Qin, D. Peng, X. Peng, X. Wang, and P. Hu, “Deep evidential learning with noisy correspondence for cross-modal retrieval,” 2022, pp. 4948–4956.
- Y. Qin, Y. Sun, D. Peng, J. T. Zhou, X. Peng, and P. Hu, “Cross-modal active complementary learning with self-refining correspondence,” arXiv preprint arXiv:2310.17468, 2023.
- S. Yang, Z. Xu, K. Wang, Y. You, H. Yao, T. Liu, and M. Xu, “Bicro: Noisy correspondence rectification for multi-modality data via bi-directional cross-modal similarity consistency,” in Proc. Comput. Vis. Pattern Recog., 2023, pp. 19 883–19 892.
- N. Sarafianos, X. Xu, and I. A. Kakadiaris, “Adversarial representation learning for text-to-image matching,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 5814–5824.
- L. Wang, Y. Li, J. Huang, and S. Lazebnik, “Learning two-branch neural networks for image-text matching tasks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 394–407, 2018.
- K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 4654–4662.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. Comput. Vis. Pattern Recog., 2020.
- X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
- J. Li, L. Liu, L. Niu, and L. Zhang, “Memorize, associate and match: Embedding enhancement via fine-grained alignment for image-text retrieval,” IEEE Trans. Image Process., 2021.
- H. Dou, P. Zhang, Y. Zhao, L. Dong, Z. Qin, and X. Li, “Gaitmpl: Gait recognition with memory-augmented progressive learning,” IEEE Trans. Image Process., 2022.
- F. Lin, Z. Qiu, C. Liu, T. Yao, H. Xie, and Y. Zhang, “Prototypical matching networks for video object segmentation,” IEEE Trans. Image Process., 2023.
- J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Proc. NeurIPS, 2021.
- J. Yang, X. Li, M. Zheng, Z. Wang, Y. Zhu, X. Guo, Y. Yuan, Z. Chai, and S. Jiang, “Membridge: Video-language pre-training with memory-augmented inter-modality bridge,” IEEE Trans. Image Process., 2023.
- Y. Dai, J. Liu, Y. Bai, Z. Tong, and L.-Y. Duan, “Dual-refinement: Joint label and feature refinement for unsupervised domain adaptive person re-identification,” IEEE Trans. Image Process., 2021.
- M. Wang, J. Mei, L. Liu, G. Tian, Y. Liu, and Z. Pan, “Delving deeper into mask utilization in video object segmentation,” IEEE Trans. Image Process., 2022.
- Y. Yang and X. Gu, “Joint correlation and attention based feature fusion network for accurate visual tracking,” IEEE Trans. Image Process., 2023.
- M. Li, C.-G. Li, and J. Guo, “Cluster-guided asymmetric contrastive learning for unsupervised person re-identification,” IEEE Trans. Image Process., 2022.
- J. Yin, X. Zhang, Z. Ma, J. Guo, and Y. Liu, “A real-time memory updating strategy for unsupervised person re-identification,” IEEE Trans. Image Process., 2023.
- R. Zheng, L. Li, C. Han, C. Gao, and N. Sang, “Camera style and identity disentangling network for person re-identification.”
- C. Zhao, Z. Qu, X. Jiang, Y. Tu, and X. Bai, “Content-adaptive auto-occlusion network for occluded person re-identification,” IEEE Transactions on Image Processing, 2023.
- Z. Huang, J. Zhang, and H. Shan, “Twin contrastive learning with noisy labels,” in Proc. Comput. Vis. Pattern Recog., 2023, pp. 11 661–11 670.
- J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,” arXiv preprint arXiv:2002.07394, 2020.
- M. S. Bucarelli, L. Cassano, F. Siciliano, A. Mantrach, and F. Silvestri, “Leveraging inter-rater agreement for classification in the presence of noisy labels,” in Proc. Comput. Vis. Pattern Recog., 2023, pp. 3439–3448.
- Y. Tu, B. Zhang, Y. Li, L. Liu, J. Li, J. Zhang, Y. Wang, C. Wang, and C. R. Zhao, “Learning with noisy labels via self-supervised adversarial noisy masking,” in Proc. Comput. Vis. Pattern Recog., 2023, pp. 16 186–16 195.
- J. Han, P. Luo, and X. Wang, “Deep self-learning from noisy labels,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 5138–5147.
- J. Mao, Q. Yu, Y. Yamakata, and K. Aizawa, “Noise-resistant learning for object detection,” IEICE Technical Report; IEICE Tech. Rep., vol. 121, no. 374, pp. 163–166, 2022.
- L. Huang, C. Zhang, and H. Zhang, “Self-adaptive training: beyond empirical risk minimization,” Proc. NeurIPS, vol. 33, pp. 19 365–19 376, 2020.
- S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” arXiv preprint arXiv:1412.6596, 2014.
- E. Arazo, D. Ortego, P. Albert, N. O’Connor, and K. McGuinness, “Unsupervised label noise modeling and loss correction,” in International conference on machine learning, 2019, pp. 312–321.
- H. Song, M. Kim, and J.-G. Lee, “Selfie: Refurbishing unclean samples for robust deep learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 5907–5915.
- J. Mao, Q. Yu, G. Irie, and K. Aizawa, “Noise-avoidance sampling for annotation missing object detection,” in Proc. Int. Conf. Image Process., 2023.
- B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” Proc. NeurIPS, vol. 31, 2018.
- D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa, “Joint optimization framework for learning with noisy labels,” in Proc. Comput. Vis. Pattern Recog., 2018, pp. 5552–5560.
- S. Li, X. Xia, S. Ge, and T. Liu, “Selective-supervised contrastive learning with noisy labels,” in Proc. Comput. Vis. Pattern Recog., 2022, pp. 316–325.
- D. Ortego, E. Arazo, P. Albert, N. E. O’Connor, and K. McGuinness, “Multi-objective interpolation training for robustness to label noise,” in Proc. Comput. Vis. Pattern Recog., 2021, pp. 6606–6615.
- Y. Qin, Y. Chen, D. Peng, X. Peng, J. T. Zhou, and P. Hu, “Noisy-correspondence learning for text-to-image person re-identification,” arXiv preprint arXiv:2308.09911, 2023.
- M. Yang, Z. Huang, P. Hu, T. Li, J. Lv, and X. Peng, “Learning with twin noisy labels for visible-infrared person re-identification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 14 308–14 317.
- E. Amrani, R. Ben-Ari, D. Rotman, and A. Bronstein, “Noise estimation using density estimation for self-supervised multimodal learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, 2021, pp. 6644–6652.
- M. Yang, Y. Li, Z. Huang, Z. Liu, P. Hu, and X. Peng, “Partially view-aligned representation learning with noise-robust contrastive loss,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1134–1143.
- M. Yang, Y. Li, P. Hu, J. Bai, J. Lv, and X. Peng, “Robust multi-view clustering with incomplete information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1055–1069, 2022.
- Z. Xu, L. Hao, and Y. Mang, “Negative pre-aware for noisy cross-modal matching,” arXiv preprint arXiv:2312.05777, 2023.
- P. Hu, Z. Huang, D. Peng, X. Wang, and X. Peng, “Cross-modal retrieval with partially mismatched pairs,” IEEE Trans. Pattern Anal. Mach. Intell., 2023.
- H. Permuter, J. Francos, and I. Jermyn, “A study of gaussian mixture models of color and texture features for image classification and segmentation,” Pattern recognition, vol. 39, no. 4, pp. 695–706, 2006.
- C. Spearman, “The proof and measurement of association between two things.” 1961.
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. Eur. Conf. Comput. Vis. Springer, 2014, pp. 740–755.
- M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.