MIS: Assessing Modality Contributions in Multimodal Systems
- Modality Importance Score (MIS) is a quantitative metric that rigorously evaluates the influence and necessity of each modality in complex multimodal systems.
- It employs diverse estimation techniques—including mutual information analysis, occlusion sensitivity, attention weight averaging, performance differentials, and KL divergence—to measure modality contributions.
- Empirical use cases demonstrate MIS's utility in diagnosing unimodal reliance, optimizing model architectures, and auditing datasets in applications such as medical imaging and video question answering.
The Modality Importance Score (MIS) is a class of quantitative metrics designed to assess the relative influence, utility, or necessity of each input modality in multimodal systems. MIS quantifies, in a rigorous and reproducible manner, how much each constituent modality (such as text, image, audio, sensor) contributes to downstream performance, information representation, inference, or model interpretability. Empirical MIS variants have proliferated in deep learning, medical imaging, human language understanding, video QA, and explainable AI. Despite diverse methodology, all MIS frameworks aim to allocate responsibility for either predictive performance or representational content to specific modalities within otherwise opaque multimodal pipelines.
1. Formal Definitions Across Methodological Families
MIS computation frameworks differ by modeling paradigm, system architecture, and the precise information of interest.
Information-Theoretic (MI-Based) MIS
"Mutual Information Analysis in Multimodal Learning Systems" defines MIS from pairwise mutual information among modalities (Hadizadeh et al., 2024):
For modalities , with pairwise mutual information (measured via compression-based entropy estimation), the MIS for modality is
Here, a high indicates that modality shares substantial information with others; normalization ensures .
Ablation and Model Sensitivity-Based MIS
In deep learning medical applications, modality importance is measured by the proportional effect of selective input occlusion (Gapp et al., 28 Feb 2025): where is the model output change upon patch-wise masking of modality in sample , and measures that modality's share of total model sensitivity, enforcing .
Fusion-Weight and Attention-Based MIS
In interpretable fusion architectures (e.g., self-attention over modalities (Galland et al., 2023)), for samples, -dimensional fusion coordinates, and modalities: where is the modality with maximal fusion weight at position , in sample .
Test-Subset Performance Differential MIS
In video QA and related evaluation, MIS is defined as the performance gain when a modality is included in the set of provided modalities, relative to when it is excluded (Park et al., 2024): where is the accuracy on sample for all modality subsets in , includes (with ), and excludes .
KL-Divergence/Deviance-Based MIS
Statistical modeling approaches, e.g., high-dimensional GLMs, define MIS as the expected gain in relative entropy when adding a modality (Jin et al., 22 Jan 2026): with sample estimator .
2. Algorithmic Estimation Procedures
A variety of estimation and inference procedures are used to compute MIS, tailored to modality type, dataset, and model class.
Compression-Based MI (InfoMeter):
- Feature maps are quantized and mapped to suitable latent spaces via invertible transforms.
- Neural entropy estimators (e.g., autoregressive iWave++ models) are trained to minimize empirical entropy , , .
- Post-hoc MI estimation yields by pairwise MI summation and normalization (Hadizadeh et al., 2024).
Occlusion/Perturbation:
- For each modality, segment input into patches; mask each in turn, measure change in output, sum changes per modality, and normalize (Gapp et al., 28 Feb 2025).
- Pseudocode is as follows:
1 2 3 4 5 6 7
for modality i in 1..n: for sample k in 1..N: for patch l in 1..h_i: x_occ = masked version of modality i, patch l d_i_k += | f(x^k) - f(x_occ) | D[i] += d_i_k m_i = D[i] / sum(D)
Attention-Weight Averaging:
- Stack modal embeddings; for each fusion dimension, identify argmax over learned importance weights; count per-modality selections and normalize (Galland et al., 2023).
Performance Differential via Modality Subset Tests:
- Evaluate the model (or an oracle MLLM) on all combinations of modalities per sample; compute differences in accuracy when a modality is included vs. excluded (Park et al., 2024).
Likelihood/Deviance-Based Inference:
- Fit full and reduced GLMs under penalization; compute log-likelihood difference scaled by sample size; MIS is normalized deviance.
- For , two-step Sure Independence Screening plus penalized likelihood (e.g., SCAD) is used (Jin et al., 22 Jan 2026).
3. Interpretations and Theoretical Rationale
MIS serves different conceptual goals depending on its formal basis:
- Redundancy vs. Uniqueness: MI-based MIS quantifies shared information, with high scores suggesting redundancy. Alternatives subtract pairwise MI from individual entropies to extract modality "uniqueness" (Hadizadeh et al., 2024).
- Sensitivity: Occlusion-based MIS quantifies how model outputs shift when a modality is perturbed, measuring the causal responsibility a modality bears for performance (Gapp et al., 28 Feb 2025).
- Explanatory Power: Attention-based MIS interprets attention or fusion weights as attributions of decision influence, linking weight allocation to classifier behavior (Galland et al., 2023).
- Data-Driven Necessity: Performance-differential MIS explicitly quantifies whether a task truly depends on a given modality for specific samples, supporting the audit of dataset biases (Park et al., 2024).
- Statistical Significance: KL/Deviance-based MIS is justified by information-theoretic gain and is equipped with confidence intervals and -values, supporting hypothesis tests on modality relevance (Jin et al., 22 Jan 2026).
4. Empirical Outcomes and Use Cases
MIS metrics yield actionable insight into the construction, evaluation, and optimization of multimodal models.
| Application Domain | MIS Estimator | Main Outcomes |
|---|---|---|
| Autonomous vehicle 3D detection | MI-sum | Lower MI higher detection accuracy |
| Medical multimodal diagnosis | Occlusion | High- modalities match strong unimodal performance |
| Motivational interviewing (counseling) | Attention | Text face audio/context; clusters in usage |
| Video QA dataset audit | Perf diff | Most questions are unimodal-biased or modality-agnostic |
| High-dimensional neuroimaging GLM | KL/Deviance | MIS with CI, FDG-PET Amyloid-PET for EF and DX |
Empirically, MIS often reveals "unimodal collapse"—the tendency for certain models or datasets to over-rely on a single modality, even when superficial fusion architectures are used (Gapp et al., 28 Feb 2025, Park et al., 2024). Normalized MIS values support cross-architecture, cross-dataset comparison.
5. Implementation Considerations and Best Practices
Best practices for reliable MIS estimation include:
- Use invertible transforms and suitable quantization to harmonize continuous-valued modality representations for entropy-based methods (Hadizadeh et al., 2024).
- For occlusion, set patch granularity to maximize disruption without excessive compute; balancing resolution vs. attribution accuracy (Gapp et al., 28 Feb 2025).
- When using attention-weighted fusion, ensure embeddings are dimensionally matched and weights are properly regularized (Galland et al., 2023).
- For statistical models, penalized estimation (e.g., SIS + SCAD) is critical under ; CIs and -values require proper asymptotic calibration (Jin et al., 22 Jan 2026).
- Public codebases are available for several methods, e.g., MC_MMD (PyTorch+MONAI) for occlusion-based MIS (Gapp et al., 28 Feb 2025). Independence of ground-truth allows universal applicability and black-box deployment.
6. Limitations, Alternatives, and Guidance
Each MIS formalism has caveats:
- MI-based scores may conflate redundancy and informativeness; low shared MI can represent complementarity but also lack of informative signal (Hadizadeh et al., 2024).
- Occlusion scores are computationally intensive (scaling with forward passes), and sensitive to patch size (Gapp et al., 28 Feb 2025).
- Attention-based MIS is interpretable only to the extent that fusion weights reflect causal influence, which is not generally assured (Galland et al., 2023).
- Performance-difference MIS is discrete, and may be uninformative where models already succeed via modality-agnostic shortcuts (Park et al., 2024).
- KL/deviance-based MIS requires that the reduced model (excluding a modality) is well-specified and regularized for accurate log-likelihood estimation (Jin et al., 22 Jan 2026).
- All global MIS variants summarize at the dataset or corpus level unless extended with per-sample procedures.
No current methodology guarantees causal attribution of modality utility absent strong interventional experiments (e.g., randomized masking). Nevertheless, cross-validation with single-modality models, permutation studies, and human annotation confirm that properly applied MIS tracks the influence and necessity of modalities with practical fidelity.
7. Summary Table of Core MIS Formalisms
| Method Family | Key Equation(s) | Data/Model Prerequisites |
|---|---|---|
| MI/Entropy-Based | Feature map quantization, learned invertible transforms, entropy estimators (Hadizadeh et al., 2024) | |
| Occlusion-Based | Model w/black-box access, patch masking over inputs (Gapp et al., 28 Feb 2025) | |
| Attention-Based | Fusion model w/attention over modal embeddings (Galland et al., 2023) | |
| Performance Differential | Multimodal model testable on all modal input subsets (Park et al., 2024) | |
| KL/Deviance-Based | ; | Penalized GLM fits, log-likelihood computation (Jin et al., 22 Jan 2026) |
MIS has become a foundational tool for quantifying the influence of individual modalities in increasingly complex multimodal pipelines, supporting model diagnostics, dataset audit, architecture design, and statistical inference across domains.