Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models (2407.16526v1)

Published 23 Jul 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Vision LLMs (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and LLMs. However, existing open-source VLMs heavily rely on pretrained and frozen vision encoders (such as CLIP). Despite CLIP's robustness across diverse domains, it still exhibits non-negligible image understanding errors. These errors propagate to the VLM responses, resulting in sub-optimal performance. In our work, we propose an efficient and robust method for updating vision encoders within VLMs. Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred, while maintaining overall robustness. Furthermore, we demonstrate the effectiveness of our method during continual few-shot updates. Theoretical grounding, generality, and computational efficiency characterize our approach.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Aristeidis Panos (8 papers)
  2. Rahaf Aljundi (33 papers)
  3. Daniel Olmeda Reino (13 papers)
  4. Richard E Turner (5 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets