Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance (2407.05578v2)

Published 8 Jul 2024 in cs.CV

Abstract: CLIP has achieved impressive zero-shot performance after pre-training on a large-scale dataset consisting of paired image-text data. Previous works have utilized CLIP by incorporating manually designed visual prompts like colored circles and blur masks into the images to guide the model's attention, showing enhanced zero-shot performance in downstream tasks. Although these methods have achieved promising results, they inevitably alter the original information of the images, which can lead to failure in specific tasks. We propose a train-free method Foveal-Attention CLIP (FALIP), which adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module. We demonstrate FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition. Experimental results further show that FALIP outperforms existing methods on most metrics and can augment current methods to enhance their performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiedong Zhuang (9 papers)
  2. Jiaqi Hu (27 papers)
  3. Lianrui Mu (7 papers)
  4. Rui Hu (96 papers)
  5. Xiaoyu Liang (18 papers)
  6. Jiangnan Ye (8 papers)
  7. Haoji Hu (30 papers)
Citations (1)