Disentangling Modality-Specific Mechanisms in Vision-LLMs
Vision-LLMs (VLMs) have demonstrated considerable skill in processing visual and textual data, yet they exhibit notable discrepancies in performance across different modalities. "Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs" investigates the underlying causes of this performance gap and proposes methods for bridging this discrepancy without additional model training.
The paper starts by identifying a performance gap between tasks performed on visual versus textual data, using VLMs to conduct tasks such as object counting, arithmetic operations, spatial ordering, factual recall, and sentiment analysis. It is observed that models, while proficient at textual tasks, exhibit reduced accuracy on their visual counterparts. For instance, models show higher accuracy in counting words than objects in an image or in identifying winners of board games when presented textually rather than visually.
To address this, the authors dissect VLMs into discrete circuits—task-specific computational sub-graphs responsible for processing each input modality. Circuit discovery techniques are employed to determine the active components during inference. These techniques highlight three main sub-circuits aligned with prompt positions: data, query, and generation. A key finding here is that while circuits are largely disjoint between text and image tasks, their functionalities are remarkably similar, except in processing modality-specific data positions.
Subsequently, the paper investigates whether components in one modality can substitute for those in another, assessing functional interchangeability. While query and generation components are interchangeable, data processing sub-circuits exhibit stark disparities. This indicates that modality-specific processing of the initial data positions is a primary contributor to performance differences.
Analyzing the representation of data tokens revealed that visual token embeddings align more closely with their analogous textual embeddings only in deeper model layers, suggesting a delay in adaptation. The authors propose back-patching as a remedy—a method where later-layer visual representations are re-injected into earlier layers during inference. Systematic back-patching achieved a significant accuracy improvement across tasks, closing approximately 32% of the identified performance gap without requiring model retraining.
This research suggests that improving visual processing in VLMs hinges on facilitating earlier and more effective alignment between visual and textual token embeddings. The implications of these findings extend to both practical and theoretical applications, offering insights into building more efficient and balanced VLM systems. Looking forward, incorporating modality-specific strategies in model architecture or training protocols may further enhance cross-modal performance. As models advance, future research could explore the benefits of flexible computational allocation per token, particularly for multi-modal scenarios where visual inputs demand more processing than text.
In summary, this paper establishes a comprehensive mechanism for understanding and addressing multi-modal performance gaps in VLMs by dissecting and patching modality-specific precedents in data token processing.