Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs (2506.09047v2)

Published 10 Jun 2025 in cs.CL

Abstract: Vision-LLMs (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

Disentangling Modality-Specific Mechanisms in Vision-LLMs

Vision-LLMs (VLMs) have demonstrated considerable skill in processing visual and textual data, yet they exhibit notable discrepancies in performance across different modalities. "Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs" investigates the underlying causes of this performance gap and proposes methods for bridging this discrepancy without additional model training.

The paper starts by identifying a performance gap between tasks performed on visual versus textual data, using VLMs to conduct tasks such as object counting, arithmetic operations, spatial ordering, factual recall, and sentiment analysis. It is observed that models, while proficient at textual tasks, exhibit reduced accuracy on their visual counterparts. For instance, models show higher accuracy in counting words than objects in an image or in identifying winners of board games when presented textually rather than visually.

To address this, the authors dissect VLMs into discrete circuits—task-specific computational sub-graphs responsible for processing each input modality. Circuit discovery techniques are employed to determine the active components during inference. These techniques highlight three main sub-circuits aligned with prompt positions: data, query, and generation. A key finding here is that while circuits are largely disjoint between text and image tasks, their functionalities are remarkably similar, except in processing modality-specific data positions.

Subsequently, the paper investigates whether components in one modality can substitute for those in another, assessing functional interchangeability. While query and generation components are interchangeable, data processing sub-circuits exhibit stark disparities. This indicates that modality-specific processing of the initial data positions is a primary contributor to performance differences.

Analyzing the representation of data tokens revealed that visual token embeddings align more closely with their analogous textual embeddings only in deeper model layers, suggesting a delay in adaptation. The authors propose back-patching as a remedy—a method where later-layer visual representations are re-injected into earlier layers during inference. Systematic back-patching achieved a significant accuracy improvement across tasks, closing approximately 32% of the identified performance gap without requiring model retraining.

This research suggests that improving visual processing in VLMs hinges on facilitating earlier and more effective alignment between visual and textual token embeddings. The implications of these findings extend to both practical and theoretical applications, offering insights into building more efficient and balanced VLM systems. Looking forward, incorporating modality-specific strategies in model architecture or training protocols may further enhance cross-modal performance. As models advance, future research could explore the benefits of flexible computational allocation per token, particularly for multi-modal scenarios where visual inputs demand more processing than text.

In summary, this paper establishes a comprehensive mechanism for understanding and addressing multi-modal performance gaps in VLMs by dissecting and patching modality-specific precedents in data token processing.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yaniv Nikankin (5 papers)
  2. Dana Arad (5 papers)
  3. Yossi Gandelsman (28 papers)
  4. Yonatan Belinkov (111 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com