Vision Language Models are Biased (2505.23941v1)

Published 29 May 2025 in cs.LG and cs.CV

Abstract: LLMs memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision LLMs (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g, unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Insert text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

PDF Abstract

The paper "Vision LLMs are Biased" (Vo et al., 29 May 2025 ) investigates a significant failure mode in state-of-the-art Vision LLMs (VLMs): their tendency to rely heavily on memorized prior knowledge, often overriding contradictory visual information from images. The authors hypothesize that VLMs, trained on vast internet data, learn strong associations for popular subjects (like the number of legs a dog has or the stripes on an Adidas logo) which can lead to biased answers even on objective visual tasks like counting or identification when presented with subtly modified, counterfactual images.

To test this, the authors introduce VLMBias, a benchmark designed to systematically evaluate VLM bias on such tasks. VLMBias focuses on objective questions with neutral phrasing but uses counterfactual images where a well-known visual characteristic has been altered (e.g., adding a leg to an animal, changing the number of stripes on a logo). The benchmark covers seven diverse domains: animals, brand logos, national flags, chess pieces, board game grids, optical illusions, and patterned grids. These domains were chosen to represent subjects with decreasing levels of prior knowledge available on the internet.

A key aspect of VLMBias is its automated generation framework, which uses a combination of LLMs (for idea generation), Python scripts (for abstract images like grids and illusions), and state-of-the-art text-to-image models (\geminiflashlogoGemini and GPT) to create the counterfactual images. Human reviewers manually filter these generated images for quality and adherence to the counterfactual criteria. The dataset comprises 1,392 counterfactual images across the seven domains, plus additional images for sanity checks and control experiments.

The authors evaluate five prominent VLMs ( Gemini, Sonnet, GPT, , and ) on VLMBias. A sanity check on original, unmodified images confirms that all tested VLMs correctly identify subjects and their standard characteristics (achieving 100% accuracy). However, when presented with counterfactual images, the models' performance collapses dramatically.

Key findings include:

Overall Low Accuracy: Across the seven tasks on counterfactual images, the average accuracy of the five VLMs is a mere 17.05%.
Strong Bias Alignment: When VLMs provide incorrect answers, they align with the expected biased response (based on prior knowledge) 75.70% of the time on average. This suggests that errors are not random but driven by overriding visual input with memorized facts.
Task Variability: Performance varies by task and domain. For example, VLMs struggled more with counting elements on car logos (0.44% accuracy) where the logo is small relative to the image, compared to shoe logos (17.57%). They also performed poorly on counting elements in national flags (9.25% mean accuracy), especially stripes (4.52%). Counting on abstract grids like board games resulted in extremely low accuracy (2.26%).
Optical Illusions: On modified optical illusions, VLMs tend to respond with the answer true for the original illusion, scoring only around random chance (50.87% mean accuracy across both original and modified versions, but only 23.74% on modified versions). This supports the hypothesis that they have memorized common illusions rather than truly understanding the visual phenomena.
Patterned Grids: Even on patterned grids created from scratch (with no prior internet knowledge), VLMs exhibit bias towards the overall grid pattern when asked about an anomalous cell that breaks the pattern (22.44% accuracy, 43.45% bias alignment), indicating a bias towards global patterns over local details.
Failure on Identification: VLMs also largely failed on direct Yes/No identification questions about counterfactual images (e.g., "Is this an animal with 4 legs?" for a 5-legged animal), achieving only 25.11% accuracy and mostly answering "Yes," reinforcing the bias towards the common visual pattern.
Impact of In-image Text: Adversarially adding text stating the subject's common name (e.g., "Adidas") to the counterfactual images further decreases VLM accuracy by an average of 4.49 percentage points, confirming the influence of textual bias even within the image itself. Thinking models showed a larger performance drop.
Limited Effect of Helpful Prompts: Explicitly instructing VLMs to rely only on image details or to double-check their answers provided only marginal improvements (1.87 and 2.70 percentage points, respectively), demonstrating the severity of the bias.

The paper concludes that contemporary VLMs exhibit significant visual biases rooted in their training data, making them unreliable on objective visual tasks where prior knowledge conflicts with actual image content. This bias is deeply ingrained, minimally affected by thinking capabilities or simple prompt engineering, and influences responses across diverse visual domains, including those with newly created patterns. The release of the VLMBias benchmark and its automated generation scripts aims to facilitate further research into diagnosing and mitigating this critical failure mode in VLMs. The authors suggest future work could explore whether this bias stems primarily from visual encoding limitations or the LLM's reliance on learned knowledge, and potentially investigate tool-using VLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

An Vo (4 papers)
Khai-Nguyen Nguyen (7 papers)
Mohammad Reza Taesiri (17 papers)
Vy Tuong Dang (1 paper)
Anh Totti Nguyen (13 papers)
Daeyoung Kim (41 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/anh_ng8/status/1929682403162751408

Reddit

Vision Language Models are Biased (111 points, 25 comments)