Modality Dominance Score (MDS)

Updated 4 May 2026

Modality Dominance Score (MDS) is a quantitative measure that assesses the relative influence of each modality through feature activations, attention, and gradient analysis.
It employs diverse techniques, including activation ratios, cross-modal attention statistics, and gradient conflict detection to diagnose and mitigate dominance bias.
MDS is applied across various domains such as vision-language models and multimodal sentiment analysis, enhancing fusion performance and reducing optimization bias.

The Modality Dominance Score (MDS) encompasses a class of quantitative measures developed to identify, characterize, and mitigate the phenomenon wherein one data modality disproportionately influences multimodal representation learning or inference, often to the detriment of optimal information fusion and task performance. MDS and closely related indices are central to understanding and controlling optimization bias in diverse multimodal settings, including vision-LLMs, embodied RGB-IR perception, video question answering, and multimodal sentiment analysis. Recent work introduces domain-specific and general-purpose mathematical formalisms for MDS that unify activation analysis, attention statistics, gradient-based criteria, and dynamic weighting schemes to provide actionable diagnostics and effective rebalancing mechanisms for multimodal architectures (Yan et al., 16 Feb 2025, Liu et al., 2 Jan 2026, Wu et al., 14 Aug 2025, Wang et al., 2024, Park et al., 2024, Yang et al., 9 Nov 2025).

1. Mathematical Foundations of Modality Dominance Score

Core instantiations of MDS formalize dominance as a scalar (or vector) quantification of the relative influence each modality exerts on the model via its feature activations, attention allocation, or parameter updates.

1.1 Feature Activation-Based MDS

In visual-LLMs such as CLIP, MDS is defined per feature (neuron) $k$ as the average fraction of that neuron's activation attributable to the image encoder versus the text encoder. Explicitly, for a held-out set of $M$ image/text pairs and feature $k$ , the MDS is:

$R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$

Here, $z_i^{(m, k)}$ and $z_t^{(m, k)}$ are the $k$ -th feature activations for the image and text, respectively, in the $m$ -th pair. $R(k)\approx 1$ indicates vision-dominance, $R(k)\approx 0$ language-dominance, and intermediate values indicate cross-modal features. Feature categories derive from thresholding $M$ 0 relative to its empirical mean and standard deviation (Yan et al., 16 Feb 2025).

1.2 Attention Allocation and Efficiency

For models employing cross-modal attention (e.g., MLLMs), the Modality Dominance Index (MDI) compares mean per-token attention across modalities:

Let $M$ 1 and $M$ 2 be total normalized cross-attention paid to text and other modalities, respectively, and $M$ 3, $M$ 4 their token counts.
Calculate mean attention per token: $M$ 5, $M$ 6.
The MDI is defined as:

$M$ 7

$M$ 8 reflects text dominance, $M$ 9 non-text dominance, and $k$ 0 balanced fusion. Generalization to $k$ 1 modalities replaces $k$ 2, $k$ 3 with $k$ 4 (Wu et al., 14 Aug 2025).

1.3 Gradient and Optimization Dynamic-Based Measures

For detecting optimization bias, MDS-type indicators use gradient magnitudes and directions. For modality subsets $k$ 5 and gradients $k$ 6, dominance is assessed via:

Cosine similarity $k$ 7
If $k$ 8 and $k$ 9, $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 0 dominates and may suppress $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 1. This gradient-based indicator is key to algorithms such as Gradient-guided Modality Decoupling (GMD), which actively removes conflicting update components to debias training (Wang et al., 2024, Liu et al., 2 Jan 2026).

1.4 Dynamic Sample-Specific Dominance

In adaptive frameworks, MDS is defined per input as a softmaxed vector $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 2 over modalities, learned from the concatenated unimodal representations. For $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 3 (language, acoustic, visual), $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 4 and $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 5. The dominant modality per sample is $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 6 (Yang et al., 9 Nov 2025).

2. Computation and Implementation Protocols

MDS computations are standardized in recent literature to ensure applicability across tasks and architectures.

2.1 Algorithmic Recipes

Activation-based MDS: Evaluate forward passes for each input with isolated modalities to extract activations or attentions, then aggregate and normalize as per the metric’s definition (Yan et al., 16 Feb 2025, Wu et al., 14 Aug 2025).
Gradient-based MDS: For each modality subset, record backpropagated gradients w.r.t. shared parameters, compute pairwise cosine similarity and projection, and decouple or reweight conflicting components (Wang et al., 2024, Liu et al., 2 Jan 2026).
Cross-modal attention MDI/MDS: Aggregate cross-modal attention weights over all outputs, layers, and heads, then normalize and compare on a per-token or per-feature basis (Wu et al., 14 Aug 2025).
Sample-specific MDS: Aggregate unimodal features, concatenate, project through a shallow MLP, and softmax to yield per-sample dominance weights (Yang et al., 9 Nov 2025).

2.2 Normalization and Thresholding

Linear min–max normalization (for feature entropy, gradient norms) as applied in RGB-IR MDI (Liu et al., 2 Jan 2026).
Softmax or convex combination for multi-modality dominance vector.
Empirical mean and deviation-based category thresholds for neuron-level attribution.

3. Integration into Multimodal Learning Frameworks

The MDS is exploited for diagnostic, debiasing, and adaptive fusion purposes.

Framework/Module	Use of MDS/MDI	Reference
MDACL (RGB-IR detection)	Guides hierarchical teacher–student selection, activation weighting, and minimal inverse weighting fusion	(Liu et al., 2 Jan 2026)
MODS (multimodal sentiment analysis)	Selects primary modality and re-scales features per sample	(Yang et al., 9 Nov 2025)
GMD (missing-modality robustness)	Monitors gradient conflict; prunes dominant update directions	(Wang et al., 2024)
Multi-modal CLIP analysis	Categorizes features as vision, language, cross-modal	(Yan et al., 16 Feb 2025)

The MDS framework is essential for:

Hierarchical Cross-modal Guidance: selecting dominant feature maps as teacher/mentor for distillation and spatial reprojection (Liu et al., 2 Jan 2026).
Adversarial Equilibrium Regularization: dynamically down-weighting the dominant branch and up-weighting the weaker at fusion time (Liu et al., 2 Jan 2026).
Task-specific fusion: adapting to sample-level or task-level modality requirements (Yang et al., 9 Nov 2025).

4. Empirical Findings and Benchmarks

Systematic evaluation demonstrates the utility of MDS-based control in mitigating optimization bias, balancing representation, and improving task metrics.

On RGB-IR detection, using MDI for fusion achieves higher mAP and lower gradient bias compared to forward-weighted or uniform baselines. Ablation studies reveal cumulative performance gains for low-level, high-level, and composite HCG and MIW modules (Liu et al., 2 Jan 2026).
In sentiment analysis, dynamic sample-wise MDS outperforms fixed-modality strategies by 2–4 percentage points across multiple datasets, confirming that modality dominance is both variable and actionable at the individual instance level (Yang et al., 9 Nov 2025).
Attention-aligned MDIs as measured for MLLMs reveal severe text dominance in late layers, often with per-token attention more than an order of magnitude higher than visual or other non-text modalities; token compression methods substantially rebalance MDI toward unity (MDI $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 7 0.86) (Wu et al., 14 Aug 2025).
In vision-language neural analysis, MDS-based partitioning generates feature categories that align with semantically interpretable axes, facilitates bias audits, and enables modally targeted generative control (Yan et al., 16 Feb 2025).

5. Extensions and Generalizations

Recent formulations expand MDS beyond simple dual-modality regimes, encompassing diverse phenomena and providing new interpretability axes.

Generalization to $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 8 modalities with dominance vectors $R(k) = \frac{1}{M}\sum_{m=1}^{M} \frac{|z_i^{(m, k)}|}{|z_i^{(m, k)}|+|z_t^{(m, k)}|}$ 9 allows for entropy- or variance-based meta-scores capturing cross-modal balance or imbalance (Wu et al., 14 Aug 2025).
Efficiency-corrected variants (e.g., Attention Efficiency Index: $z_i^{(m, k)}$ 0) account for disproportionate tokenization, redundancy, or semantic density.
Task- and layer-adaptive normalization aligns MDS with expected contribution priors for each modality and measurement depth (Wu et al., 14 Aug 2025).
Dynamic MDS in fusion architectures enables per-sample adaptation rather than fixed fusion rules (Yang et al., 9 Nov 2025).
Information-theoretic and mutual information-based MDS extensions are proposed for finer-grained modality sensitivity detection beyond ratio statistics (Yan et al., 16 Feb 2025).
The MDS principle underlies ablation protocols (e.g., the Modality Importance Score—MIS—via accuracy under input permutation) that validate importance assignments and inform dataset curation (Park et al., 2024).

6. Limitations, Challenges, and Future Directions

MDS and related indices present several caveats and open areas:

Reliance on empirical statistics and thresholding for category assignment can be arbitrary; clustering or probabilistic criteria may yield improved robustness (Yan et al., 16 Feb 2025).
In architectures with complex or asynchronous cross-modal entanglement, attribution of dominance can be nontrivial, especially for highly polysemantic representations.
Domain transferability of MDS operationalizations is currently limited; further work on speech, graph, and egocentric sensor data is ongoing.
MDS provides a statistical, not causal, account: high dominance may not reflect true irreducible modality utility unless verified by intervention (permutation, dropout) (Park et al., 2024).
Human-in-the-loop or calibrated synthetic benchmarks are advocated for validating automated MDS assignments.
Extensions to multi-scale, temporally resolved, or semantically conditioned dominance scores are proposed to bridge local/global, instant/sequential modality interplay (Wu et al., 14 Aug 2025, Yang et al., 9 Nov 2025).

7. Comparative Summary of MDS Instantiations

Metric	Domain	Definition Basis	Dominance Detection Mechanism	Reference
Feature Activation MDS	Vision-language	Ratio of feature activations	Feature-level task/semantic analysis	(Yan et al., 16 Feb 2025)
Attention-based MDI/MDS	MLLMs	Mean attention per token	Cross-modal attention statistics	(Wu et al., 14 Aug 2025)
Gradient-based Score	Multi-modal fusion	Cosine/projection of gradients	Conflict removal, decoupling	(Wang et al., 2024, Liu et al., 2 Jan 2026)
Dynamic Dominance Vector	Multimodal sentiment	Softmax over fused unimodal features	Sample-level fusion and weighting	(Yang et al., 9 Nov 2025)
MIS (accuracy-based)	Video QA	Performance difference on ablated subsets	Paired accuracy/permutation	(Park et al., 2024)

These modalities reflect a converging consensus: robust multimodal learning demands quantification and management of modality dominance at multiple algorithmic levels. The MDS framework, through its diverse embodiments, provides both theoretical insight and practical tools for advancing equitable, interpretable, and empirically superior multimodal AI systems.