Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Words or Vision: Do Vision-Language Models Have Blind Faith in Text? (2503.02199v1)

Published 4 Mar 2025 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.MM

Abstract: Vision-LLMs (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings. By introducing textual variations to four vision-centric tasks and evaluating ten Vision-LLMs (VLMs), we discover a \emph{``blind faith in text''} phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns. We analyze factors influencing this text bias, including instruction prompts, LLM size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the LLM size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from LLMs. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training. Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance their robustness and reliability in handling multi-modal data inconsistencies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ailin Deng (11 papers)
  2. Tri Cao (12 papers)
  3. Zhirui Chen (11 papers)
  4. Bryan Hooi (158 papers)