Per-Instance Modality Weighting

Updated 1 April 2026

Per-instance modality weighting is a technique that dynamically assigns importance to each modality per sample based on reliability, information content, or uncertainty.
It improves multimodal fusion by addressing modality imbalance and optimizing performance under noisy or incomplete data conditions.
Mathematical frameworks like uncertainty-based gates and logit decomposition enable precise, instance-specific calibration across diverse applications.

Per-instance modality weighting is a class of techniques in multimodal learning where the contribution of each input modality (e.g., text, image, audio, tabular data) is adaptively determined at the granularity of each individual sample or decision. Unlike global or dataset-level weighting—which may bias toward a dominant modality regardless of instance context—per-instance weighting dynamically allocates importance according to sample-specific information content, reliability, contribution, or uncertainty. This approach addresses modality imbalance, enhances robustness, supports interpretability, and enables calibrated operation even under partial observability or noisy conditions.

1. Motivation and Conceptual Foundations

Early multimodal models often naively aggregated modalities, producing fixed or heuristically balanced combinations, and were prone to dominance by the highest-capacity or least-noisy branch. This dominance leads to suboptimal fusion, where weaker but informative modalities can be ignored, and strong modalities can mislead under specific failure modes. Per-instance modality weighting addresses these challenges by:

Quantifying, for each example, the salient contribution, reliability, or information gain from each modality branch.
Enabling sample-specific calibration tailored to missing data, uncertain predictions, or distributional shifts.
Providing interpretability into which modality drove the final decision for a given case, thus facilitating auditability and clinical or domain trust (Bakumenko et al., 19 Nov 2025).

A crucial insight driving this research direction is that multimodal systems face finite and variable “information budgets” and must allocate fusion weights adaptively rather than statically (Xiong et al., 18 Mar 2026, Huang et al., 13 Feb 2026).

2. Mathematical Frameworks and Algorithms

Multiple formal mechanisms for per-instance mode weighting have been introduced:

2.1. Linear Decomposition and Attribution

In transparent ensemble architectures, such as late-fusion logistic regression, instance-wise fusion weights are obtained by decomposing the decision logit as a linear sum of standardized modality logits, scaled by learned fusion coefficients. For each instance $i$ , the logistic regression fuses standardized branch logits $\tilde z^v_i, \tilde z^n_i$ by

$\ell_i = b + w_v \tilde z^v_i + w_n \tilde z^n_i$

Decomposing $\ell_i$ gives absolute logit contributions $c^v_i$ , $c^n_i$ and normalized shares $\tilde{s}^v_i$ , $\tilde{s}^n_i$ per instance, directly quantifying the relative impact of each modality (Bakumenko et al., 19 Nov 2025).

2.2. Information-Theoretic and Uncertainty-Based Gates

The IIBalance framework introduces Intrinsic Information Budgets (IIB) at the dataset level, then adjusts fusion weights at the instance level by combining the global information budgets $\beta_m$ and per-sample uncertainty $u_m^{(i)}$ :

$\tilde z^v_i, \tilde z^n_i$ 0

Here, $\tilde z^v_i, \tilde z^n_i$ 1 encodes the uncertainty and pooled embeddings, and a shallow network calibrates the weights. The result is a probabilistic, Bayesian-inspired weighting that respects both global information capacity and local uncertainty (Xiong et al., 18 Mar 2026).

2.3. Proxy-Task Reliability and Confidence

In MARGO for recommendation, the model extracts a reliability vector $\tilde z^v_i, \tilde z^n_i$ 2 per training triple $\tilde z^v_i, \tilde z^n_i$ 3, reflecting whether modality $\tilde z^v_i, \tilde z^n_i$ 4 helps to correctly rank positive over negative items, then supervises the learning of fusion weights via a confidence-weighted KL-divergence loss:

$\tilde z^v_i, \tilde z^n_i$ 5

Confidence scalar $\tilde z^v_i, \tilde z^n_i$ 6 based on final BPR margin modulates the effect, ensuring unreliable or ambiguous instances provide weaker supervision (Dong et al., 23 Apr 2025).

2.4. Score Calibration and Penalty Fusion

In training-free composed retrieval (FreeDom), instance-level image and text similarities are min-normalized and then combined:

$\tilde z^v_i, \tilde z^n_i$ 7

The multiplicative “AND” rewards balanced, high-confidence cues, while the penalty suppresses one-sided matches (Psomas et al., 29 Oct 2025). This implements instance-dependent balancing during retrieval.

2.5. KL-Divergence–Driven and MI-Adjusted Weights

BTW computes, for each instance $\tilde z^v_i, \tilde z^n_i$ 8 and modality $\tilde z^v_i, \tilde z^n_i$ 9, the KL-divergence between the unimodal and joint predictions as $\ell_i = b + w_v \tilde z^v_i + w_n \tilde z^n_i$ 0, then forms normalized instance weights or bi-level weights using global mutual information $\ell_i = b + w_v \tilde z^v_i + w_n \tilde z^n_i$ 1:

$\ell_i = b + w_v \tilde z^v_i + w_n \tilde z^n_i$ 2

This two-level scheme stabilizes variance and promotes global alignment (Hou et al., 25 Aug 2025).

3. Modalities, Domains, and Architectures

Per-instance weighting mechanisms have been instantiated across diverse architectures and domains, including:

Clinical prediction: Joint time-series (LSTM) and text (Transformer) models for ICU mortality (Bakumenko et al., 19 Nov 2025).
Visual-linguistic retrieval: CLIP-based image-text composition with posthoc score calibration (Psomas et al., 29 Oct 2025).
Recommender systems: BPR-trained visual + textual joint recommenders (Dong et al., 23 Apr 2025).
Multitask learning: Per-instance, per-task weights in human pose/shape and semantic segmentation (Vasu et al., 2021).
Multimodal transformers and CNNs: MWAM dynamically gates modalities by frequency-domain metrics (Lu et al., 26 Feb 2026).
MoE-based fusion: Instance-level variance stabilization for sentiment and clinical prediction (Hou et al., 25 Aug 2025).
Multimodal LLMs: Adaptive preference steering by entropy-based diagnostics (Huang et al., 13 Feb 2026).

The majority of mechanisms operate in a late-fusion or modular-branch setting, where each modality’s specialized encoder produces a partially independent prediction or representation.

4. Empirical Effects and Diagnostics

Extensive empirical investigations demonstrate the benefits of per-instance weighting across tasks:

Domain	Main Metric(s)	Static Fusion	Per-instance Weighting	Gain
ICU Mortality	AUROC, AUPRC	0.876/0.526	0.891/0.565	+0.015/+0.039 AUROC/AUPRC over best single modality
Multimodal Recommend	Recall@20	baseline	+3.3%	+3.3% Recall@20 (Amazon datasets)
Sentiment Regression	MAE, Acc	0.735/52.28%	0.714/54.28% (BTW)	Lower MAE, +2% accuracy
Segmentation	Dice, PCR	85.07/5.62	85.93/5.32 (MWAM)	+0.86 Dice, −0.3 PCR
Composed Retrieval	macro-mAP	28.5%	31.6% (FreeDom)	+3 pp macro-mAP (i-CIR)

In addition to discrimination gains, several frameworks achieve:

Calibration: Ensemble calibration error (ECE) as low as 0.133 (Bakumenko et al., 19 Nov 2025).
Robustness to missing modalities: Calibrated fallback preserves accuracy under ablation (Bakumenko et al., 19 Nov 2025).
Fine-grained diagnostics: Per-instance weighting exposes which modality dominates, supports conflict inspection and audit (Bakumenko et al., 19 Nov 2025, Dong et al., 23 Apr 2025).

Ablations consistently show degradation when removing adaptive weighting, demonstrating that fine-grained weighting is essential to exploit weak or redundant modalities, suppress noise, and stabilize training (Xiong et al., 18 Mar 2026, Hou et al., 25 Aug 2025, Lu et al., 26 Feb 2026).

5. Implementation Strategies and Algorithmic Patterns

While implementation details vary, several common strategies recur:

Logit Decomposition: Linear fusion layer admitting post-hoc analytic attribution (Bakumenko et al., 19 Nov 2025).
Auxiliary Networks or Gating: Small neural modules predict modality weights from instance-level statistics (Xiong et al., 18 Mar 2026, Huang et al., 13 Feb 2026).
Proxy Task–Driven Labels: Use margin-based reliability from BPR, contrastive, or reconstruction objectives to define weak supervision for modality weights (Dong et al., 23 Apr 2025).
Information-Theoretic Metrics: Entropy, mutual information, KL divergence as alignment/gating signals (Xiong et al., 18 Mar 2026, Hou et al., 25 Aug 2025).
Frequency-Domain Preference: DCT-based frequency energy ratios for low-cost, instance-aware gating (Lu et al., 26 Feb 2026).
Self-calibrated Combination: Multiplicative/penalized fusion schemes rewarding balanced agreement (Psomas et al., 29 Oct 2025).
Sparsity and Efficiency: Intermediate representations (weights, gates) computed on-device with low runtime or memory overhead; inference time deployment with or without trainable weights (Lu et al., 26 Feb 2026, Psomas et al., 29 Oct 2025).

6. Challenges, Limitations, and Ongoing Directions

Several open questions and limitations persist:

Most current methods assume the availability of all modalities during training; handling consistently missing or corrupted modalities in production is still an active area (Bakumenko et al., 19 Nov 2025).
Theoretical justifications vary; some techniques rely on information theory, others are empirical or heuristic (Xiong et al., 18 Mar 2026, Hou et al., 25 Aug 2025).
Interpretability is higher for linear-fusion and explicit diagnostic mechanisms, while deep gating mechanisms may require post-hoc analysis for transparency (Bakumenko et al., 19 Nov 2025, Xiong et al., 18 Mar 2026).
Scalability to high numbers of modalities can pose challenges for both parameterization and marginal calibration (Hou et al., 25 Aug 2025).
Decision-layer bias remains a concern when per-instance weights are not available; future work advocates capability-aware allocation even at the output stage (Ma et al., 16 Oct 2025).

7. Connections and Generalization

Per-instance modality weighting extends naturally to multitask and multiobjective learning, where learned $\ell_i = b + w_v \tilde z^v_i + w_n \tilde z^n_i$ 3 control the impact of each task's loss per-sample (Vasu et al., 2021). The common thread is that per-sample reliability and informativeness are not uniform across the dataset or across modalities; adaptive allocation both boosts performance and enhances interpretability. Unifying themes include:

Reliance on instance-level diagnostic proxies (uncertainty, margin, influence) as core signals.
Soft supervision for weight learning, often via calibration or regularization losses.
Modular architectures enabling decomposable attributions and plug-and-play interventions (Bakumenko et al., 19 Nov 2025, Lu et al., 26 Feb 2026).
Empirically demonstrated robustness to noise, missingness, and imbalance across tasks and architectures.

A plausible implication is that as multimodal systems further scale in capacity and complexity, per-instance weighting mechanisms will become integral for achieving not only state-of-the-art accuracy but also reliable, interpretable, and robust operation.