Integrity-guided Adaptive Fusion

Updated 28 November 2025

The paper introduces a novel mechanism that dynamically selects the dominant modality for fusion using learned integrity scores and multi-scale reconstruction losses.
Integrity-guided Adaptive Fusion is a framework that quantifies the completeness of input modalities and adaptively incorporates auxiliary data when primary signals are deficient.
It leverages cross-modal attention and a two-stage training strategy to maintain robustness and fine-grained predictions even under extreme missing data conditions.

Integrity-guided Adaptive Fusion (IF) is a mechanism within multimodal learning pipelines that dynamically selects and fuses information from multiple modalities based on their estimated integrity and reconstructed quality. In contemporary Multimodal Sentiment Analysis (MSA), where input data may suffer from uncertain or severe modality missingness, integrity-guided fusion prioritizes the most complete and trustworthy modalities during inference and enables the model to adaptively exploit auxiliary signals when the dominant input is noisy or ∂eficient. This approach has been operationalized in the Senti-iFusion framework, which targets both inter- and intra-modality missingness and achieves state-of-the-art results in fine-grained sentiment analysis tasks (Li et al., 21 Nov 2025).

1. Design Objectives and Core Principles

Integrity-guided Adaptive Fusion is constructed to address three central challenges in multimodal fusion under missing data: (i) selecting modalities for fusion on-the-fly, based on completeness and reconstruction quality; (ii) leveraging fallback strategies that incorporate auxiliary modalities when the dominant one is compromised; and (iii) fusing representations via cross-modal attention mechanisms rather than naive concatenation or averaging. Integrity of each modality is quantified as a learned integrity score $Ĩ_m \in [0,1]$ , with higher scores signifying more complete and reliable input. Quality, while not explicitly represented as a scalar, is enforced during training via multi-scale reconstruction and mutual-information losses, which collectively compel the recovery of semantic content consistent with the original input. This design enables the fusion process to privilege the most reliable modality for each sample or batch, while adaptively attending to auxiliary modalities in proportion to their semantic utility.

2. Mathematical Formulation and Training Objectives

Let $U = \{U_l, U_a, U_v\}$ denote the input modalities (language, acoustic, visual), subjected to random inter- and intra-modality masking to yield $\tilde{U}$ . Feature extraction and integrity estimation proceed as follows:

Embedding encoders $\mathcal{E}_{\text{emb}}^m$ map masked inputs $\tilde{U}_m$ to incomplete embeddings $\tilde{u}_m$ .
A 2-layer Transformer-based integrity estimator produces

$Ĩ_m = \mathcal{E}_{\text{ie}}^m(\operatorname{Concat}(X_{\text{ie}}, \tilde{u}_m))$

trained via the loss

$L_{\text{ie}} = \frac{1}{N} \sum_{k=1}^N \|\ Ĩ_m^k - I_m^k \|^2_2$

where $I_m = 1 - \text{(fraction of masked tokens)}$ .

Reconstruction quality is enforced using two sets of loss terms:

Feature-level MSE ( $L_{\text{mse}}^g$ ) and MI ( $L_{\text{mi}}^g$ ) losses compare completed features $ũ_m$ to their originals $u_m$ .
Semantic-level losses ( $L_{\text{mse}}^s$ , $L_{\text{mi}}^s$ ) compare disentangled shared features $ĥ_m^s$ against re-encoded reconstructions.

Dominant-modality selection is batch-based:

$μ_m = \frac{1}{B} \sum_{i=1}^B Ĩ_m^{(i)}$

and $dom = \arg\max_{m \in \{l, a, v\}} μ_m$ , with auxiliaries $a_1, a_2$ as the other modalities.

3. Architecture and Data Flow

The architecture comprises several sequential modules:

Stage	Function	Output
Input masking	Simulate real-world missingness	Masked multimodal features $\tilde{U}$
Embedding	Encode masked modalities	Incomplete embeddings $\tilde{u}_m$
Integrity estimation	Assess completeness of each modality	Integrity scores $Ĩ_m$
Integrity-weighted completion	Disentangle and reconstruct features	Surrogate and recovered features
Adaptive fusion	Select dominant modality, cross-attend	Fused representation $h_{\text{fuse}}^3$
Prediction	Final Transformer+linear head	Sentiment prediction $\hat{y}$

The completion module builds surrogate features:

$ĥ_m^{\text{sur}} = Ĩ_m \cdot \tilde{u}_m + (1 - Ĩ_m) \cdot (\hat{h}_{m_1}^s + \hat{h}_{m_2}^s)$

and proceeds by decoding, re-encoding, and enforcing dual-depth losses.

Adaptive fusion then processes the dominant modality through Transformer layers and fuses them with auxiliary modalities using attention:

Queries $Q_{\text{dom}}$ , keys/values $K_m, V_m$ from auxiliaries, attention weights $\gamma_m = \text{softmax}(Q_{\text{dom}} K_m^\top / \sqrt{d_k})$ .
Fused representations updated recursively:

$h_{\text{fuse}}^j = h_{\text{fuse}}^{j-1} + \sum_{m \in \{a_1, a_2\}} \gamma_m V_m$

Classification uses the fused features, prepended with [CLS]-style tokens, processed by a cross-modal Transformer and a linear prediction head.

4. Algorithmic Workflow and Pseudocode

The IF module follows a two-stage training procedure:

Initial epochs (stage 1): only integrity estimation and completion modules are trained, freezing the final predictor. The objective is

$L_{\text{stage1}} = \alpha \cdot L_{\text{ie}} + \beta \cdot L_{\text{rec}}$

where $L_{\text{rec}} = L_{\text{rec}}^{\text{enc}} + L_{\text{rec}}^{\text{dec}}$ .

Final epochs (stage 2): all modules are updated with

$L_{\text{stage2}} = \alpha \cdot L_{\text{ie}} + \beta \cdot L_{\text{rec}} + \sigma \cdot L_{\text{pred}}$

The operational pseudocode, paraphrased, follows the structure outlined below:

for modality in {language, acoustic, visual}:
    incomplete_embed = E_emb^modality(concat([E̲], masked_input))
    integrity_score = E_ie^modality(concat([I̲], incomplete_embed))  # completeness
    shared, private = E^s_modality(incomplete_embed), E^p_modality(incomplete_embed)
    # Surrogate construction, reconstruction, similarity/difference loss computation
    surrogate = integrity_score * incomplete_embed + (1 - integrity_score) * sum_of_shared_feats
    completed = Decoder_modality(surrogate)
    reencoded_shared = E^s_modality(completed)
    # Dual-depth losses: MSE & MI
if epoch <= pretrain_epochs:
    update integrity & completion modules only
else:
    # Adaptive fusion: dominant modality selection, cross-attention
    mean_integrity_per_modality → dominant_modality
    processed_dom = process_through_transformers(surrogate)
    fused = processed_dom
    for fusion_layer:
        query = processed_dom @ Q_weight
        for aux_modality:
            key = surrogate @ K_weight
            value = surrogate @ V_weight
            attention = softmax(query @ key.T / sqrt(d_k))
            fused += attention * value
    # Prediction: Transformer + linear head
    final_pred = E_pred(concat([CLS_dom, processed_dom], [CLS_fuse, fused]))
    sentiment_out = Linear(final_pred)
    update all modules end-to-end

5. Hyperparameterization and Ablation Outcomes

Key operational hyperparameters are:

Input length $T=8$ , hidden dimension $d=128$ , batch size $64$.
AdamW optimizer with $1 \times 10^{-4}$ initial learning rate, weight decay $1 \times 10^{-4}$ , cosine annealing, warm-up and early stopping.
Stage 1: $40$ epochs; stage 2: $110$ (MOSI) or $160$ (MOSEI) epochs.
Loss weights: $\alpha=0.9$ , $\beta=0.4$ , $\sigma=1.0$ ; decoder-specific: $\lambda_{\text{mse}}^g=0.5$ , $\lambda_{\text{mi}}^g=0.4$ , $\lambda_{\text{mse}}^s=0.3$ , $\lambda_{\text{mi}}^s=0.2$ .

Ablation experiments under $50\%$ drop-rate quantify the necessity of each module:

Omitting integrity-weighted surrogates raises MAE (MOSI) from $1.1554$ to $1.1740$ and reduces F1 score.
No integrity loss ( $L_{\text{ie}}$ ) drops Acc-7 by $\sim3\%$ .
Removing dual-depth reconstruction loss ( $L_{\text{rec}}^{\text{dec}}$ ) cuts F1 by $0.01-0.02$.
Disabling the two-stage strategy degrades both MAE and classification accuracy.

Under extreme missingness (drop rate up to $0.9$), conventional fusion methods collapse to a single dominant class, whereas Senti-iFusion maintains fine-grained sentiment predictions, demonstrating resilience and reliability due to integrity-guided fusion.

6. Significance, Context, and Implications

Integrity-guided Adaptive Fusion redefines multimodal fusion in incomplete and noisy data regimes by explicit evaluation of modality integrity and dynamic attention-driven feature fusion. Its empirical advantage is bolstered by robust ablation results: state-of-the-art performance in MAE, F1, and Acc-5 metrics when tested on prevalent benchmarks subject to simulated missingness (Li et al., 21 Nov 2025). The approach enables models to continuously adapt to the best available information and recover from missing cues, which is crucial for real-world deployments in human-computer interaction, medical informatics, and any application area where multimodal signals are prone to degradation. A plausible implication is broader adoption of integrity-guided mechanisms in future architectures where reliability of observed modalities cannot be guaranteed.

PDF Markdown Chat (Pro)

References (1)

Senti-iFusion: An Integrity-centered Hierarchical Fusion Framework for Multimodal Sentiment Analysis under Uncertain Modality Missingness (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Integrity-guided Adaptive Fusion.