Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval (2109.05523v1)

Published 12 Sep 2021 in cs.CV and cs.CL

Abstract: Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grain semantic units in both sides of vision and language. For the training, we propose multi-scale matching losses from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhihao Fan (28 papers)
  2. Zhongyu Wei (98 papers)
  3. Zejun Li (18 papers)
  4. Siyuan Wang (73 papers)
  5. Haijun Shan (8 papers)
  6. Xuanjing Huang (287 papers)
  7. Jianqing Fan (165 papers)
Citations (11)