How Do Vision-Language Models Process Conflicting Information Across Modalities? (2507.01790v1)

Published 2 Jul 2025 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-LLMs, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic "router heads" which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.

Summary

The paper reveals that VLMs experience performance degradation when processing conflicting image-caption pairs, showing inherent modality biases.
The study employs probing and clustering analyses to demonstrate that biases stem from deeper attention head layers rather than encoding failures.
Manipulating specific attention heads, like router and promotion heads, improves target-modal fidelity and cross-dataset generalization.

How Do Vision-LLMs Process Conflicting Information Across Modalities?

Introduction

The integration of multimodal inputs has become essential in advancing AI systems that operate in varied and complex environments. The paper investigates how Vision-LLMs (VLMs) handle inconsistent information across modalities, using inputs like an image of a dog accompanied by a caption "A photo of a cat." This paper is significant as it reveals differential treatment of conflicting modalities among VLMs, with some preferring image data and others text. Furthermore, the researchers explore how certain attention heads facilitate or bias this preference.

Performance Degradation of VLMs on Conflicting Inputs

The paper's experimental setup involved creating inconsistent image-caption pairs and evaluating models on their ability to report information correctly based on a specified target modality, either visual or textual. The main findings highlight performance degradation across all tested models when exposed to these conflicting inputs, with models like InstructBLIP showing a pronounced bias towards visual information in contradictory settings.

Figure 1: Examples of inconsistent image and caption pairs. Models need to report image or caption information based on the target modality.

Internal Representational Analysis

To understand the underlying causes behind VLM behavior, the researchers employed probing techniques to assess whether models encode unimodal information adequately. It was found that VLMs do encode independent modal information accurately, suggesting that the performance drop isn't due to an encoding failure. Instead, clustering analysis revealed that the models' bias towards a modality is embedded within deeper layers of the network, influencing decision-making based on the target modality prompt.

Figure 2: Accuracy of unimodal and consistency probes indicate VLMs effectively encode modality-specific information, in addition to detecting consistency.

Attention Head Role in Modality Preference

The paper identifies specific attention heads responsible for dynamically restructuring representations according to the target modality. These include modality-agnostic "router heads" and modality-specific "promotion heads." By manipulating these heads, researchers demonstrated the ability to adjust the models’ output preference, improving target-modal fidelity when processing conflicting inputs.

Figure 3: Different attention head types in Qwen2.5-VL, showing their influence on modality-specific answers.

Cross-Dataset Generalization and Intervention

Intervention studies revealed that altering the functionality of these attention heads can generalize to other datasets. This finding suggests the potential for utilizing attention head manipulation to enhance VLM accuracy in diverse real-world scenarios involving conflicting multimodal information.

Conclusion

The paper provides insights into the adaptability of VLMs when dealing with cross-modal information conflicts. The identification of attention heads that modulate modality preference is a key step toward developing more reliable multimodal AI systems. Future research should investigate the scalability of these findings across larger model architectures and broader modality datasets to more fully harness the potential of VLMs in practical applications. Through continuous exploration and manipulation of attention structures, AI systems may efficiently resolve information conflicts, leading to more coherent and contextually aware multimodal interactions.