Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion (2405.10029v2)

Published 16 May 2024 in cs.MM

Abstract: The image-text retrieval task aims to retrieve relevant information from a given image or text. The main challenge is to unify multimodal representation and distinguish fine-grained differences across modalities, thereby finding similar contents and filtering irrelevant contents. However, existing methods mainly focus on unified semantic representation and concept alignment for multi-modalities, while the fine-grained differences across modalities have rarely been studied before, making it difficult to solve the information asymmetry problem. In this paper, we propose a novel asymmetry-sensitive contrastive learning method. By generating corresponding positive and negative samples for different asymmetry types, our method can simultaneously ensure fine-grained semantic differentiation and unified semantic representation between multi-modalities. Additionally, a hierarchical cross-modal fusion method is proposed, which integrates global and local-level features through a multimodal attention mechanism to achieve concept alignment. Extensive experiments performed on MSCOCO and Flickr30K, demonstrate the effectiveness and superiority of our proposed method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. “Microsoft coco: Common objects in context,” in ECCV. Springer, 2014, pp. 740–755.
  2. “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015, pp. 2641–2649.
  3. “Lexlip: Lexicon-bottlenecked language-image pre-training for large-scale image-text sparse retrieval,” in ICCV, 2023, pp. 11206–11217.
  4. “Vse++: Improving visual-semantic embeddings with hard negatives,” BMVC, 2018.
  5. “Stacked cross attention for image-text matching,” in ECCV, 2018, pp. 201–216.
  6. “Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval,” in CVPR, 2020, pp. 12655–12663.
  7. “Dynamic modality interaction modeling for image-text retrieval,” in SIGIR, 2021, pp. 1104–1113.
  8. “Learning dual semantic relations with graph attention for image-text matching,” TCSVT, vol. 31, no. 7, pp. 2866–2879, 2020.
  9. “Negative sample is negative in its own way: Tailoring negative sentences for image-text retrieval,” in NAACL-HLT, 2022, pp. 2667–2678.
  10. “Learning to represent image and text with denotation graph,” in EMNLP, 2022, pp. 823–839.
  11. “Seeing out of the box: End-to-end pre-training for vision-language representation learning,” in CVPR, 2021, pp. 12976–12985.
  12. “Vista: vision and scene text aggregation for cross-modal retrieval,” in CVPR, 2022, pp. 5184–5193.
  13. “Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval,” in CVPR, 2022, pp. 15692–15701.
  14. “Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval,” in NAACL, 2021, pp. 982–997.
  15. “Learning the best pooling strategy for visual semantic embedding,” in CVPR, 2021, pp. 15789–15798.
  16. “Thinking fast and slow: Efficient text-to-visual retrieval with transformers,” in CVPR, 2021, pp. 9826–9836.
  17. “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV. Springer, 2020, pp. 121–137.
  18. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
  19. “Uniter: Universal image-text representation learning,” in ECCV. Springer, 2020, pp. 104–120.
  20. Andrej Karpathy and Li Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in CVPR, 2015, pp. 3128–3137.
  21. “Grounded compositional semantics for finding and describing images with sentences,” TACL, vol. 2, pp. 207–218, 2014.
  22. “Filtering, distillation, and hard negatives for vision-language pre-training,” in CVPR, 2023, pp. 6967–6977.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ziyu Gong (7 papers)
  2. Chengcheng Mai (2 papers)
  3. Yihua Huang (17 papers)