Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval (2403.05261v1)

Published 8 Mar 2024 in cs.CV and cs.MM

Abstract: Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (40)

Authors (4)

Hailang Huang (4 papers)
Zhijie Nie (16 papers)
Ziqiao Wang (40 papers)
Ziyu Shang (8 papers)

Citations (7)

View on Semantic Scholar

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval (2403.05261v1)

Related Papers