How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding (2508.20279v1)

Published 27 Aug 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Multimodal LLMs (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, we introduce a probing framework to systematically analyze how MLLMs process visual and textual inputs across layers. We train linear classifiers to predict fine-grained visual categories (e.g., dog breeds) from token embeddings extracted at each layer, using a standardized anchor question. To uncover the functional roles of different layers, we evaluate these probes under three types of controlled prompt variations: (1) lexical variants that test sensitivity to surface-level changes, (2) semantic negation variants that flip the expected answer by modifying the visual concept in the prompt, and (3) output format variants that preserve reasoning but alter the answer format. Applying our framework to LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL, we identify a consistent stage-wise structure in which early layers perform visual grounding, middle layers support lexical integration and semantic reasoning, and final layers prepare task-specific outputs. We further show that while the overall stage-wise structure remains stable across variations in visual tokenization, instruction tuning data, and pretraining corpus, the specific layer allocation to each stage shifts notably with changes in the base LLM architecture. Our findings provide a unified perspective on the layer-wise organization of MLLMs and offer a lightweight, model-agnostic approach for analyzing multimodal representation dynamics.

Summary

The paper introduces a linear probing framework that analyzes how MLLMs process visual and textual data across model layers.
It finds early layers focus on visual grounding, middle layers on lexical integration and semantic reasoning, and later layers on answer decoding.
Results highlight a consistent stage-wise processing structure across models, with architectural differences influencing layer allocation for specific tasks.

Probing Multimodal LLMs: Understanding Layer-wise Dynamics

The paper "How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding" focuses on unraveling the internal processing dynamics of Multimodal LLMs (MLLMs). It proposes a probing framework for examining how these models handle visual and textual inputs across various layers. This approach sheds light on the distinct roles that different layers play in tasks such as image captioning and visual question answering.

Methodology and Probing Framework

The paper introduces a systematic framework that utilizes linear probing to discern how MLLMs process visual and text inputs layer by layer. Specifically, linear classifiers are trained to predict fine-grained visual categories using token embeddings from each transformer layer. The research implements controlled prompt variations: lexical variants alter surface-level expressions, semantic negation variants invert expected answers by modifying visual concepts, and output format variants retain reasoning but modify answer formats.

Through comprehensive testing on layers of models like LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL, the research identifies a stage-wise structure that remains consistent despite differences in model architecture and training data. Early layers focus on visual grounding, middle layers on lexical alignment and semantic reasoning, and final layers on task-specific decoding.

Figure 1: Layer-wise stage comparison between LLaVA-1.5 and Qwen2-VL.

Analysis of Layer-wise Processing

The paper's findings particularly highlight the structured progression of information through the model layers:

Visual Grounding: Early layers are dedicated to encoding visual inputs, which remain largely insensitive to textual prompts.
Lexical Integration: The middle layers show marked sensitivity to lexical variations, demonstrating where visual features begin aligning with linguistic cues.
Semantic Reasoning: Observations show middle to upper-middle layers commit to specific interpretations based on semantic changes, such as those introduced by negation.
Answer Decoding: The later layers are responsible for converting the internal reasoning into coherent task-specific output formats.
Figure 2: Linear probing at decoder layer k, showing training and testing methodologies across prompt variants.

Determinants of Processing Dynamics

Despite differences in visual tokenization and instruction-tuning data among various models, fundamental processing dynamics are preserved. However, the distribution of layer depth allocated to specific stages varies substantially with the architecture of the base LLM. For instance, while LLaVA-1.5 and LLaVA-Next exhibit similar processing hierarchies, Qwen2-VL allocates fewer layers to visual grounding and extends reasoning across more layers, underscoring the architectural impact on information integration.

Figure 3: Probing accuracy across layers for lexical and semantic negation variants.

Probing methods, previously employed in understanding LLM internals, are adeptly adapted in this paper for multimodal settings. This framework contrasts with model-specific instrumentation methods like causal tracing, by providing architecture-agnostic insights into the interaction of visual and textual modalities without modifying model behavior.

Conclusion

This research offers a compelling framework for interpreting the internal configurations of MLLMs, revealing their stage-wise processing of visual and textual information. By elucidating the layer-wise division of labor within these models, the findings advance both interpretability and potentially model optimization strategies. The stage-wise mapping described here could guide future work in developing more efficient and transparent multimodal AI systems.