AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion (2405.10029v2)
Abstract: The image-text retrieval task aims to retrieve relevant information from a given image or text. The main challenge is to unify multimodal representation and distinguish fine-grained differences across modalities, thereby finding similar contents and filtering irrelevant contents. However, existing methods mainly focus on unified semantic representation and concept alignment for multi-modalities, while the fine-grained differences across modalities have rarely been studied before, making it difficult to solve the information asymmetry problem. In this paper, we propose a novel asymmetry-sensitive contrastive learning method. By generating corresponding positive and negative samples for different asymmetry types, our method can simultaneously ensure fine-grained semantic differentiation and unified semantic representation between multi-modalities. Additionally, a hierarchical cross-modal fusion method is proposed, which integrates global and local-level features through a multimodal attention mechanism to achieve concept alignment. Extensive experiments performed on MSCOCO and Flickr30K, demonstrate the effectiveness and superiority of our proposed method.
- “Microsoft coco: Common objects in context,” in ECCV. Springer, 2014, pp. 740–755.
- “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015, pp. 2641–2649.
- “Lexlip: Lexicon-bottlenecked language-image pre-training for large-scale image-text sparse retrieval,” in ICCV, 2023, pp. 11206–11217.
- “Vse++: Improving visual-semantic embeddings with hard negatives,” BMVC, 2018.
- “Stacked cross attention for image-text matching,” in ECCV, 2018, pp. 201–216.
- “Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval,” in CVPR, 2020, pp. 12655–12663.
- “Dynamic modality interaction modeling for image-text retrieval,” in SIGIR, 2021, pp. 1104–1113.
- “Learning dual semantic relations with graph attention for image-text matching,” TCSVT, vol. 31, no. 7, pp. 2866–2879, 2020.
- “Negative sample is negative in its own way: Tailoring negative sentences for image-text retrieval,” in NAACL-HLT, 2022, pp. 2667–2678.
- “Learning to represent image and text with denotation graph,” in EMNLP, 2022, pp. 823–839.
- “Seeing out of the box: End-to-end pre-training for vision-language representation learning,” in CVPR, 2021, pp. 12976–12985.
- “Vista: vision and scene text aggregation for cross-modal retrieval,” in CVPR, 2022, pp. 5184–5193.
- “Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval,” in CVPR, 2022, pp. 15692–15701.
- “Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval,” in NAACL, 2021, pp. 982–997.
- “Learning the best pooling strategy for visual semantic embedding,” in CVPR, 2021, pp. 15789–15798.
- “Thinking fast and slow: Efficient text-to-visual retrieval with transformers,” in CVPR, 2021, pp. 9826–9836.
- “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV. Springer, 2020, pp. 121–137.
- “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
- “Uniter: Universal image-text representation learning,” in ECCV. Springer, 2020, pp. 104–120.
- Andrej Karpathy and Li Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in CVPR, 2015, pp. 3128–3137.
- “Grounded compositional semantics for finding and describing images with sentences,” TACL, vol. 2, pp. 207–218, 2014.
- “Filtering, distillation, and hard negatives for vision-language pre-training,” in CVPR, 2023, pp. 6967–6977.
- Ziyu Gong (7 papers)
- Chengcheng Mai (2 papers)
- Yihua Huang (17 papers)