Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation
Abstract: Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, \ie, noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on \emph{similarity-guided training with hard negatives} and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely \textbf{S}elf-\textbf{R}einforcing \textbf{E}rrors \textbf{M}itigation (SREM). Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and estimate model's sensitivity of selected clean samples using swapped classification entropy, in view of the overall prediction distribution. Additionally, we propose cross-modal biased complementary learning to leverage negative matches overlooked in hard-negative training, further improving model optimization stability and curbing self-reinforcing errors. Extensive experiments on challenging benchmarks affirm the efficacy and efficiency of SREM.
- Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16430–16441.
- Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12655–12663.
- Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15789–15798.
- Two wrongs don’t make a right: Combating confirmation bias in learning with label noise. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 14765–14773.
- Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI conference on artificial intelligence, volume 35, 1218–1226.
- Discriminative complementary-label learning with weighted loss. In International Conference on Machine Learning, 3587–3597. PMLR.
- Noisy Correspondence Learning with Meta Similarity Correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7517–7526.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
- Cross-Modal Retrieval with Partially Mismatched Pairs. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Learning with noisy correspondence for cross-modal matching. Advances in Neural Information Processing Systems, 34: 29406–29419.
- Label propagation for deep semi-supervised learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5070–5079.
- Learning from complementary labels. Advances in neural information processing systems, 30.
- Complementary-label learning for arbitrary losses and models. In International Conference on Machine Learning, 2971–2980. PMLR.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV), 201–216.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34: 9694–9705.
- Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision, 4654–4662.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
- Graph matching with bi-level noisy correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 23362–23371.
- Energy-based out-of-distribution detection. Advances in neural information processing systems, 33: 21464–21475.
- Deep evidential learning with noisy correspondence for cross-modal retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, 4948–4956.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.
- Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5022–5030.
- Universal weighting metric learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10): 6534–6545.
- Active learning for domain adaptation: An energy-based approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 8708–8716.
- Generative-discriminative complementary learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 6526–6533.
- Learning with twin noisy labels for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14308–14317.
- Robust multi-view clustering with incomplete information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1): 1055–1069.
- Partially view-aligned representation learning with noise-robust contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1134–1143.
- BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19883–19892.
- Searching to exploit memorization effect in learning with noisy labels. In International Conference on Machine Learning, 10789–10798. PMLR.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67–78.
- Learning with biased complementary labels. In Proceedings of the European conference on computer vision (ECCV), 68–83.
- Provable Dynamic Fusion for Low-Quality Multimodal Data. arXiv preprint arXiv:2306.02050.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.