Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts (2112.02399v3)

Published 4 Dec 2021 in cs.CV and cs.CL

Abstract: Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Longtian Qiu (9 papers)
  2. Renrui Zhang (100 papers)
  3. Ziyu Guo (49 papers)
  4. Ziyao Zeng (12 papers)
  5. Zilu Guo (9 papers)
  6. Yafeng Li (5 papers)
  7. Guangnan Zhang (1 paper)
Citations (40)