Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

76 1

Are Vision Language Models Texture or Shape Biased and Can We Steer Them? (2403.09193v1)

Published 14 Mar 2024 in cs.CV, cs.AI, cs.LG, and q-bio.NC

Abstract: Vision LLMs (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.

PDF HTML Abstract

Exploring the Texture vs. Shape Bias in Vision LLMs (VLMs)

Introduction

Vision LLMs (VLMs) have evolved to be a pivotal component in the intersection of computer vision and natural language processing, enabling a myriad of applications from zero-shot image classification to comprehensive image captioning. A fascinating question that arises in the context of VLMs is their alignment with human visual perception, particularly in how they navigate the balance between texture and shape bias. Historically, vision-only models displayed a pronounced preference for texture over shape, a pattern that diverges from human visual tendencies which favor shape. This paper explores the texture vs. shape bias within various VLMs and assesses whether the bias can be moderated or redirected through linguistic prompts, laying the groundwork for deeper inquiry into how these models perceive and interpret visual information.

Texture vs. Shape Bias in VLMs

An exhaustive analysis of popular VLMs reveals a nuanced landscape where, contrary to prior vision-only models, many VLMs exhibit a stronger inclination toward shape bias when processing visual information. This shift suggests that multimodal training involving both text and images does not merely transplant vision encoders' biases into VLMs but instead modulates these biases through linguistic integration. Crucially, while VLMs demonstrate a more shape-oriented approach than their vision-only counterparts, they still fall short of replicating the human propensity to prioritize shape significantly. Notably, certain models demonstrate an ability to adjust their bias based on the task, displaying varying levels of shape preference in tasks like visual question answering (VQA) and image captioning.

Investigation of Bias Modulation

The central inquiry into whether and how the visual biases in VLMs can be influenced through language reveals compelling outcomes. By employing task-specific prompting and altering the visual input (through pre-processing techniques like patch shuffling and noise addition), the paper explores the malleability of shape and texture biases. Intriguingly, text-based manipulations underscore the possibility of steering these biases to a considerable degree, albeit not as substantially as through visual alterations. This discovery opens intriguing avenues for research into the interplay between textual and visual information in guiding model perception.

Implications and Future Directions

The findings of this paper have broad implications, both theoretical and practical. On a theoretical level, the evidence that VLMs’ visual biases can be partially steered through linguistic inputs enriches our understanding of multimodal learning dynamics and the complex interplay between text and image processing. Practically, the ability to modulate visual biases in VLMs could enhance model performance across tasks that require nuanced visual understanding, from improved accessibility tools to more accurate visual search and annotation systems.

Looking ahead, this exploration sets the stage for further studies into the multimodal workings of VLMs, encouraging a deeper dive into the mechanisms that underpin bias modulation. Additionally, given the rapid evolution of VLM technologies, future work could extend beyond texture and shape bias to uncover other potential biases and the extent to which they can be shaped through multimodal interactions.

Conclusion

This paper provides a foundational exploration of the texture vs. shape bias in VLMs, revealing a marked departure from the tendencies observed in vision-only models. Through meticulous experimentation, it establishes that while VLMs naturally exhibit a more significant shape bias, this bias can be influenced, albeit modestly, through linguistic prompts. These insights not only enrich our understanding of VLMs’ operational dynamics but also offer practical pathways to enhance their alignment with human visual perception, marking a significant step forward in the quest to create more intuitive and effective multimodal models.

PDF Markdown Bookmark Chat (Pro)

References (85)

Authors (8)

Paul Gavrikov (13 papers)
Jovita Lukasik (13 papers)
Steffen Jung (13 papers)
Robert Geirhos (28 papers)
Bianca Lamm (5 papers)
Muhammad Jehanzeb Mirza (10 papers)
Margret Keuper (77 papers)
Janis Keuper (66 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/ShahabBakht/status/1769372839809860077

https://twitter.com/BioPapers/status/1898025167667732504

https://twitter.com/PaulGavrikov/status/1768599394444820658

https://twitter.com/jmie_mirza/status/1882196722191130707

https://twitter.com/PaulGavrikov/status/1780144769974325292

https://twitter.com/anmorgan2414/status/1802052366524399941

YouTube

Show All Videos