A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

Published 31 Mar 2026 in cs.LG, cs.CL, and cs.CV | (2603.29676v1)

Abstract: Large vision-LLMs (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a PID framework that partitions LVLM predictions into redundancy, vision-uniqueness, language-uniqueness, and synergy.
It employs layer-wise and temporal analysis across 26 state-of-the-art models to reveal task-dependent multimodal fusion dynamics and accuracy correlations.
Results show that scaling and training protocols modulate the balance between emergent fusion and language-based generalization, informing LVLM optimization.

Comprehensive Information-Decomposition Analysis of Large Vision-LLMs

Introduction

The paper "A Comprehensive Information-Decomposition Analysis of Large Vision-LLMs" (2603.29676) addresses the interpretability and internal information pathways of large vision-LLMs (LVLMs), a domain where accuracy-based metrics alone yield an incomplete understanding of the models’ decision-making. The authors present a model-agnostic framework based on Partial Information Decomposition (PID) to dissect the modality contributions and their interactions in LVLMs, distinguishing between redundancy, vision-uniqueness, language-uniqueness, and synergy as sources of predictive information.

Their methodology is scalable to high-dimensional model outputs and enables comparative, layer-wise, and temporal analysis across 26 SOTA LVLMs, four tasks, and different training phases. By quantifying the information spectrum that drives model predictions, this approach reveals cross-family strategies, task-dependent information regimes, and the layerwise and temporal emergence of multimodal fusion.

Methodology

PID Framework and Adaptations

Central to the study is the formal integration of the PID framework into LVLM analysis. Classic mutual information fails to distinguish among overlapping and complementary informational components from inputs. PID, in contrast, partitions the mutual information $I(X_1, X_2; Y)$ from multimodal sources (vision $X_1$ , language $X_2$ ) to a target $Y$ , into:

$R$ (Redundancy): shared information,
$U_1$ (Vision-uniqueness): vision-exclusive information,
$U_2$ (Language-uniqueness): language-exclusive information,
$S$ (Synergy): emerging only from joint processing.

A scalable BATCH estimator is adapted to support continuous, high-dimensional embeddings output by LVLMs. Unimodal probes are obtained by masking the counterpart modality at the embedding level with calibrated noise, ensuring in-distribution statistics and mitigating estimation artifacts.

Experimental Design

Three main analytical dimensions are considered:

Cross-model/cross-task: 26 LVLMs from 11 families are compared across MMBench, POPE, PMC-VQA, and Reefknot, covering general reasoning, hallucination, and domain-specific tasks.
Layer-wise dynamics: PID is applied at each transformer block (via the logit lens), quantifying how each layer modulates information flow and fusion.
Training trajectory: The PID spectrum is traced through pretraining and instruction-tuning phases of LLaVA-1.5 models, capturing the evolution of fusion capabilities.

Empirical Findings

Task Regimes, Model Families, and Scaling

Strong empirical dichotomies arise in both task and model behaviors:

Synergy-driven vs. Knowledge-driven regimes: Tasks can be partitioned by their demand for synergy ( $S$ ) vs. language-side priors ( $U_2$ ). On MMBench and POPE, high synergy dominates and correlates strongly with accuracy (Spearman’s $X_1$ 0, $X_1$ 1), reflecting tasks where genuine multimodal fusion is essential. On PMC-VQA and Reefknot, where language-side knowledge is primary, $X_1$ 2 is the main accuracy correlate, and synergy is limited by the need for internal knowledge not manifest in the image.
Fusion-centric vs. Language-centric families: Model families systematically cluster into fusion-centric (high $X_1$ 3, low $X_1$ 4) and language-centric (low $X_1$ 5, high $X_1$ 6) strategies. These tendencies persist across tasks and are robust to model scaling.
Scaling effects: Larger checkpoints in fusion-centric families achieve accuracy gains predominantly through increases in $X_1$ 7, not $X_1$ 8. This contradicts the hypothesis that scale primarily increases reliance on language priors; for tasks with strong multimodal dependencies, fusion scaling is the operative axis.

Layerwise and Temporal Patterns

Three-phase layerwise dynamics: Across representative models, layerwise PID profiling reveals a robust three-phase information trajectory: early layers: negligible $X_1$ 9, mid/late layers: emergence and peak of $X_2$ 0, final layer: synergy spike concurrent with $X_2$ 1 drop. This demarcates a decisive final fusion event distinct from gradual language-prior building.
Fusion emergence in training: During vision-language alignment pretraining, $X_2$ 2 and $X_2$ 3 are flat; fusion capacity is unlocked almost exclusively during visual instruction tuning. Notably, scale modulates the $X_2$ 4 vs. $X_2$ 5 emphasis: smaller models (7B) see $X_2$ 6 increase most during fine-tuning, larger models (13B) amplify $X_2$ 7. Thus, scaling and training protocol jointly drive the balance between emergent fusion and language-based generalization.

Alignment with Intervention and Qualitative Case Studies

Ablation removing visual context validates the PID synergy term as a marker for true visual reliance and fusion. Case studies confirm that fusion-centric and language-centric models can arrive at correct answers via orthogonal modalities or strategies—either through emergent synergy or by correcting/overriding language priors with visual cues, as made explicit by the PID decomposition.

Implications and Future Directions

The PID-based diagnostic provides a principled, quantitative process-level metric for multimodal inference distinct from accuracy. Its major implications include:

Benchmark and objective construction: PID spectrum signals (especially $X_2$ 8 and $X_2$ 9) can inform benchmark design (by controlling for synergy or language dependence) and could serve as auxiliary optimization targets during instruction tuning for desired fusion-generalization tradeoffs.
Scaling and architecture search: Family-level strategy is highly stable; manipulating training protocols, model composition, and scale alters the $Y$ 0– $Y$ 1 balance. PID could become a guidepost for architectural and curriculum design, especially in settings with specific multimodal integration desiderata.
Generative and open-ended extension: Current PID estimators require discrete output spaces; generalizations to structured or open-ended prediction settings are a key avenue for extending the framework’s diagnostic coverage.

Limitations include the reliance on approximate unimodal probes, the restriction to discrete-candidate tasks, and the fundamentally correlational (not causal) nature of the method, as highlighted by the authors.

Conclusion

This work establishes a systematic, model-agnostic, and information-theoretically grounded approach for interrogating LVLMs, revealing latent strategies and task demands invisible to conventional aggregate metrics. The PID information spectrum ( $Y$ 2, $Y$ 3, $Y$ 4, $Y$ 5) enables precise attribution of model decisions to unique, redundant, or synergistic modalities, facilitating targeted analysis and design improvements in LVLM development. The framework’s extensibility positions it for a pivotal role in future research on multimodal foundation model transparency, interpretability, and controlled generalization (2603.29676).