Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations (2305.06152v3)

Published 6 May 2023 in cs.CL, cs.AI, and cs.MM

Abstract: Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. As illustrated in Fig.~reffig:case (a), the models cannot make a distinction between An astronaut rides a horse" andA horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning representations in multi-modal scenarios. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Yufeng Huang (14 papers)
  2. Jiji Tang (7 papers)
  3. Zhuo Chen (319 papers)
  4. Rongsheng Zhang (36 papers)
  5. Xinfeng Zhang (44 papers)
  6. Weijie Chen (52 papers)
  7. Zeng Zhao (16 papers)
  8. Zhou Zhao (218 papers)
  9. Tangjie Lv (35 papers)
  10. Zhipeng Hu (38 papers)
  11. Wen Zhang (170 papers)
Citations (11)