Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval (2403.05261v1)
Abstract: Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.
- Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), 252–263.
- Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), 81–91.
- Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics).
- Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), 385–393.
- * SEM 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, 32–43.
- Unicom: Universal and Compact Representation Learning for Image Retrieval. In ICLR.
- Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
- Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15789–15798.
- A Simple Framework for Contrastive Learning of Visual Representations. In III, H. D.; and Singh, A., eds., Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 1597–1607. PMLR.
- Eccv caption: Correcting false negatives by collecting machine-and-human-verified image-caption associations for ms-coco. In European Conference on Computer Vision, 1–19. Springer.
- Probabilistic Embeddings for Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8415–8424.
- SentEval: An Evaluation Toolkit for Universal Sentence Representations. arXiv:1803.05449.
- Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI conference on artificial intelligence, volume 35, 1218–1226.
- SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv:2104.08821.
- SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger. arXiv:2303.17561.
- Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, 4904–4916. PMLR.
- Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV), 201–216.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 12888–12900. PMLR.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34: 9694–9705.
- Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1): 641–656.
- Integrating Language Guidance into Image-Text Matching for Correcting False Negatives. IEEE Transactions on Multimedia, 1–14.
- Image-Text Retrieval with Binary and Continuous Label Supervision. arXiv:2210.11319.
- Integrating Listwise Ranking into Pairwise-based Image-Text Retrieval. arXiv:2305.16566.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
- Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1096–1104.
- Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), 1–8.
- Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4004–4012.
- Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO. arXiv:2004.15020.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084.
- Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33: 16857–16867.
- Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748.
- Van Horn, G. 2018. Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 2, 5.
- Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35: 5696–5710.
- Caltech-UCSD birds 200.
- Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15671–15680.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67–78.
- X22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-VLM: All-In-One Pre-trained Model For Vision-Language Tasks. arXiv:2211.12402.
- Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15661–15670.
- Hailang Huang (4 papers)
- Zhijie Nie (16 papers)
- Ziqiao Wang (40 papers)
- Ziyu Shang (8 papers)