Papers
Topics
Authors
Recent
Search
2000 character limit reached

MIS: Assessing Modality Contributions in Multimodal Systems

Updated 9 February 2026
  • Modality Importance Score (MIS) is a quantitative metric that rigorously evaluates the influence and necessity of each modality in complex multimodal systems.
  • It employs diverse estimation techniques—including mutual information analysis, occlusion sensitivity, attention weight averaging, performance differentials, and KL divergence—to measure modality contributions.
  • Empirical use cases demonstrate MIS's utility in diagnosing unimodal reliance, optimizing model architectures, and auditing datasets in applications such as medical imaging and video question answering.

The Modality Importance Score (MIS) is a class of quantitative metrics designed to assess the relative influence, utility, or necessity of each input modality in multimodal systems. MIS quantifies, in a rigorous and reproducible manner, how much each constituent modality (such as text, image, audio, sensor) contributes to downstream performance, information representation, inference, or model interpretability. Empirical MIS variants have proliferated in deep learning, medical imaging, human language understanding, video QA, and explainable AI. Despite diverse methodology, all MIS frameworks aim to allocate responsibility for either predictive performance or representational content to specific modalities within otherwise opaque multimodal pipelines.

1. Formal Definitions Across Methodological Families

MIS computation frameworks differ by modeling paradigm, system architecture, and the precise information of interest.

Information-Theoretic (MI-Based) MIS

"Mutual Information Analysis in Multimodal Learning Systems" defines MIS from pairwise mutual information among modalities (Hadizadeh et al., 2024):

For MM modalities {M1,…,MM}\{M_1,\ldots,M_M\}, with pairwise mutual information I(Mm;Mn)I(M_m; M_n) (measured via compression-based entropy estimation), the MIS for modality mm is

MISm=∑n≠mI(Mm;Mn)∑k=1M∑l≠kI(Mk;Ml)MIS_m = \frac{\sum_{n\neq m} I(M_m; M_n)}{\sum_{k=1}^M \sum_{l\neq k} I(M_k; M_l)}

Here, a high MISmMIS_m indicates that modality mm shares substantial information with others; normalization ensures ∑m=1MMISm=1\sum_{m=1}^M MIS_m = 1.

Ablation and Model Sensitivity-Based MIS

In deep learning medical applications, modality importance is measured by the proportional effect of selective input occlusion (Gapp et al., 28 Feb 2025): mi=1⊤di∑j=1n1⊤djm_i = \frac{\mathbf{1}^\top \mathbf{d}_i}{\sum_{j=1}^n \mathbf{1}^\top \mathbf{d}_j} where di,lk=∣p0k−pi,lk∣\mathbf{d}_{i,l}^k = |\mathbf{p}_0^k - \mathbf{p}_{i,l}^k| is the model output change upon patch-wise masking of modality ii in sample kk, and mim_i measures that modality's share of total model sensitivity, enforcing ∑i=1nmi=1\sum_{i=1}^n m_i = 1.

Fusion-Weight and Attention-Based MIS

In interpretable fusion architectures (e.g., self-attention over modalities (Galland et al., 2023)), for NN samples, dd-dimensional fusion coordinates, and MM modalities: MISm=1Nd∑s=1N∑i=1d1[ms,i∗=m]MIS_m = \frac{1}{Nd} \sum_{s=1}^N \sum_{i=1}^d \mathbf{1}[m^*_{s,i}=m] where ms,i∗m^*_{s,i} is the modality with maximal fusion weight at position ii, in sample ss.

Test-Subset Performance Differential MIS

In video QA and related evaluation, MIS is defined as the performance gain when a modality is included in the set of provided modalities, relative to when it is excluded (Park et al., 2024): MISi,mj=perf(qi∣Mj+)−perf(qi∣Mj−)MIS_{i, m_j} = perf(q_i | M_j^+) - perf(q_i | M_j^-) where perf(qi∣M′)perf(q_i | M') is the accuracy on sample qiq_i for all modality subsets in M′M', Mj+M_j^+ includes mjm_j (with ∣S∣≥2|S| \geq 2), and Mj−M_j^- excludes mjm_j.

KL-Divergence/Deviance-Based MIS

Statistical modeling approaches, e.g., high-dimensional GLMs, define MIS as the expected gain in relative entropy when adding a modality (Jin et al., 22 Jan 2026): MISj=EX[DKL(p(y∣X−j,Xj)∥p(y∣X−j))]MIS_j = \mathbb{E}_X \big[ D_{KL}(p(y|X_{-j}, X_j) \| p(y|X_{-j})) \big] with sample estimator MIS^j=1n[ℓ(β^)−ℓ(β^0)]\widehat{MIS}_j = \frac{1}{n} [\ell(\hat{\beta}) - \ell(\hat{\beta}_0)].

2. Algorithmic Estimation Procedures

A variety of estimation and inference procedures are used to compute MIS, tailored to modality type, dataset, and model class.

Compression-Based MI (InfoMeter):

  • Feature maps are quantized and mapped to suitable latent spaces via invertible transforms.
  • Neural entropy estimators (e.g., autoregressive iWave++ models) are trained to minimize empirical entropy hXh_X, hYh_Y, hX,Yh_{X,Y}.
  • Post-hoc MI estimation yields MISmMIS_m by pairwise MI summation and normalization (Hadizadeh et al., 2024).

Occlusion/Perturbation:

  • For each modality, segment input into patches; mask each in turn, measure change in output, sum changes per modality, and normalize (Gapp et al., 28 Feb 2025).
  • Pseudocode is as follows:
    1
    2
    3
    4
    5
    6
    7
    
    for modality i in 1..n:
        for sample k in 1..N:
            for patch l in 1..h_i:
                x_occ = masked version of modality i, patch l
                d_i_k += | f(x^k) - f(x_occ) |
            D[i] += d_i_k
    m_i = D[i] / sum(D)

Attention-Weight Averaging:

  • Stack modal embeddings; for each fusion dimension, identify argmax over learned importance weights; count per-modality selections and normalize (Galland et al., 2023).

Performance Differential via Modality Subset Tests:

  • Evaluate the model (or an oracle MLLM) on all combinations of modalities per sample; compute differences in accuracy when a modality is included vs. excluded (Park et al., 2024).

Likelihood/Deviance-Based Inference:

  • Fit full and reduced GLMs under penalization; compute log-likelihood difference scaled by sample size; MIS is normalized deviance.
  • For p≫np \gg n, two-step Sure Independence Screening plus penalized likelihood (e.g., SCAD) is used (Jin et al., 22 Jan 2026).

3. Interpretations and Theoretical Rationale

MIS serves different conceptual goals depending on its formal basis:

  • Redundancy vs. Uniqueness: MI-based MIS quantifies shared information, with high scores suggesting redundancy. Alternatives subtract pairwise MI from individual entropies to extract modality "uniqueness" (Hadizadeh et al., 2024).
  • Sensitivity: Occlusion-based MIS quantifies how model outputs shift when a modality is perturbed, measuring the causal responsibility a modality bears for performance (Gapp et al., 28 Feb 2025).
  • Explanatory Power: Attention-based MIS interprets attention or fusion weights as attributions of decision influence, linking weight allocation to classifier behavior (Galland et al., 2023).
  • Data-Driven Necessity: Performance-differential MIS explicitly quantifies whether a task truly depends on a given modality for specific samples, supporting the audit of dataset biases (Park et al., 2024).
  • Statistical Significance: KL/Deviance-based MIS is justified by information-theoretic gain and is equipped with confidence intervals and pp-values, supporting hypothesis tests on modality relevance (Jin et al., 22 Jan 2026).

4. Empirical Outcomes and Use Cases

MIS metrics yield actionable insight into the construction, evaluation, and optimization of multimodal models.

Application Domain MIS Estimator Main Outcomes
Autonomous vehicle 3D detection MI-sum Lower MI →\to higher detection accuracy
Medical multimodal diagnosis Occlusion High-mim_i modalities match strong unimodal performance
Motivational interviewing (counseling) Attention Text ≈\approx face >> audio/context; clusters in usage
Video QA dataset audit Perf diff Most questions are unimodal-biased or modality-agnostic
High-dimensional neuroimaging GLM KL/Deviance MIS with CI, FDG-PET >> Amyloid-PET for EF and DX

Empirically, MIS often reveals "unimodal collapse"—the tendency for certain models or datasets to over-rely on a single modality, even when superficial fusion architectures are used (Gapp et al., 28 Feb 2025, Park et al., 2024). Normalized MIS values support cross-architecture, cross-dataset comparison.

5. Implementation Considerations and Best Practices

Best practices for reliable MIS estimation include:

  • Use invertible transforms and suitable quantization to harmonize continuous-valued modality representations for entropy-based methods (Hadizadeh et al., 2024).
  • For occlusion, set patch granularity to maximize disruption without excessive compute; balancing resolution vs. attribution accuracy (Gapp et al., 28 Feb 2025).
  • When using attention-weighted fusion, ensure embeddings are dimensionally matched and weights are properly regularized (Galland et al., 2023).
  • For statistical models, penalized estimation (e.g., SIS + SCAD) is critical under p≫np \gg n; CIs and pp-values require proper asymptotic calibration (Jin et al., 22 Jan 2026).
  • Public codebases are available for several methods, e.g., MC_MMD (PyTorch+MONAI) for occlusion-based MIS (Gapp et al., 28 Feb 2025). Independence of ground-truth allows universal applicability and black-box deployment.

6. Limitations, Alternatives, and Guidance

Each MIS formalism has caveats:

  • MI-based scores may conflate redundancy and informativeness; low shared MI can represent complementarity but also lack of informative signal (Hadizadeh et al., 2024).
  • Occlusion scores are computationally intensive (scaling with N∑hiN \sum h_i forward passes), and sensitive to patch size (Gapp et al., 28 Feb 2025).
  • Attention-based MIS is interpretable only to the extent that fusion weights reflect causal influence, which is not generally assured (Galland et al., 2023).
  • Performance-difference MIS is discrete, and may be uninformative where models already succeed via modality-agnostic shortcuts (Park et al., 2024).
  • KL/deviance-based MIS requires that the reduced model (excluding a modality) is well-specified and regularized for accurate log-likelihood estimation (Jin et al., 22 Jan 2026).
  • All global MIS variants summarize at the dataset or corpus level unless extended with per-sample procedures.

No current methodology guarantees causal attribution of modality utility absent strong interventional experiments (e.g., randomized masking). Nevertheless, cross-validation with single-modality models, permutation studies, and human annotation confirm that properly applied MIS tracks the influence and necessity of modalities with practical fidelity.

7. Summary Table of Core MIS Formalisms

Method Family Key Equation(s) Data/Model Prerequisites
MI/Entropy-Based MISm=∑n≠mI(Mm;Mn)∑k=1M∑l≠kI(Mk;Ml)MIS_m = \frac{\sum_{n\neq m} I(M_m; M_n)}{\sum_{k=1}^M \sum_{l\neq k} I(M_k; M_l)} Feature map quantization, learned invertible transforms, entropy estimators (Hadizadeh et al., 2024)
Occlusion-Based mi=1⊤Di∑j1⊤Djm_i = \frac{1^\top D_i}{\sum_j 1^\top D_j} Model w/black-box access, patch masking over inputs (Gapp et al., 28 Feb 2025)
Attention-Based MISm=1Nd∑s,i1[ms,i∗=m]MIS_m = \frac{1}{Nd} \sum_{s,i} \mathbf{1}[m^*_{s,i}=m] Fusion model w/attention over modal embeddings (Galland et al., 2023)
Performance Differential MISi,mj=perf(qi∣Mj+)−perf(qi∣Mj−)MIS_{i,m_j} = perf(q_i | M_j^+) - perf(q_i | M_j^-) Multimodal model testable on all modal input subsets (Park et al., 2024)
KL/Deviance-Based MISj=EXDKL(p(y∣X−j,Xj)∥p(y∣X−j))MIS_j = \mathbb{E}_X D_{KL}(p(y|X_{-j}, X_j) \| p(y|X_{-j})); MIS^j=[ℓ(β^)−ℓ(β^0)]/n\widehat{MIS}_j = [\ell(\hat{\beta}) - \ell(\hat{\beta}_0)] / n Penalized GLM fits, log-likelihood computation (Jin et al., 22 Jan 2026)

MIS has become a foundational tool for quantifying the influence of individual modalities in increasingly complex multimodal pipelines, supporting model diagnostics, dataset audit, architecture design, and statistical inference across domains.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modality Importance Score (MIS).