Explaining multimodal LLMs via intra-modal token interactions (2509.22415v2)

Published 26 Sep 2025 in cs.CV and cs.AI

Abstract: Multimodal LLMs (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.

Summary

The paper introduces a novel framework that improves interpretability by focusing on intra-modal token interactions.
It employs Multi-Scale Explanation Aggregation (MSEA) to integrate spatial contexts in visual data for more coherent explanations.
Activation Ranking Correlation (ARC) reduces spurious textual activations, achieving improvements up to 14.52% across benchmarks.

The paper presented in this paper addresses a critical aspect of Multimodal LLMs (MLLMs): the interpretability of their internal decision-making processes. These models, while excelling in various vision-language tasks, remain opaque in terms of understanding their reasoning mechanisms. This work proposes novel techniques to enhance interpretability by focusing on intra-modal interactions, which have been largely overlooked by existing methods that concentrate on cross-modal attributions.

Introduction

MLLMs have achieved significant success across diverse tasks such as visual question answering and image captioning. However, the models' internal reasoning remains poorly understood, leading to limitations in diagnosing errors, improving model design, and ensuring safety and accountability. The existing focus on cross-modal attribution fails to adequately explain model predictions due to insufficient modeling of intra-modal dependencies. Particularly, visual modality explanations often miss spatial context, while textual modality explanations are susceptible to spurious activations caused by reliance on preceding tokens.

Proposed Framework

The paper introduces a framework designed to strengthen intra-modal interactions for more faithful and coherent interpretability in MLLMs. The two key components are:

Multi-Scale Explanation Aggregation (MSEA)

MSEA enhances visual interpretability by aggregating attributions over inputs at multiple scales. This approach dynamically adjusts receptive fields to produce holistic and spatially coherent visual explanations, addressing the limitation of fragmented representations due to isolated image patches.

Activation Ranking Correlation (ARC)

For the textual modality, ARC mitigates the influence of irrelevant preceding tokens. It measures relevance by aligning top-k prediction rankings, suppressing spurious activations from irrelevant contexts while preserving semantically coherent interactions. This improves the fidelity of attributions by focusing on relevant token interactions.

Figure 1: Overview of our proposed framework.

Experiments and Results

The proposed methods were evaluated on various MLLMs including LLaVA-1.5, Qwen2-VL, and InternVL2.5, across datasets such as COCO Caption, OpenPSG, and GranDf. The framework consistently outperformed existing interpretability methods, as demonstrated by improvements in metrics like Obj-IoU, Func-IoU, and F1-IoU:

Overall, improvements ranged from 3.69% to 14.52% across different models and datasets.
Significant reductions in noise and false positives were observed, particularly in Func-IoU, indicating the framework's robustness in suppressing irrelevant activations.

Figure 2: Performance sensitivity to scale factors.

Discussion

The integrations of MSEA and ARC not only improve attribution fidelity but also showcase the importance of modeling intra-modal dynamics within MLLMs. These enhancements allow for more precise diagnostics of model behavior, facilitating safer deployment in applications demanding high trust levels. The results suggest that integrating multi-scale spatial contexts, alongside effective contextual suppression mechanisms, yields significant improvements in model interpretability.

Conclusion

By addressing the overlooked aspect of intra-modal interactions, this research proposes significant advancements in the field of MLLM interpretability. The introduced methods, MSEA and ARC, demonstrate robust capability across models and tasks, enhancing the quality of explanations and enabling deeper insights into model behaviors. Future work could explore further fine-tuning of these methods and their integration into broader model architectures to enhance transparency in more complex AI systems.