Papers
Topics
Authors
Recent
2000 character limit reached

Integrity-guided Adaptive Fusion

Updated 28 November 2025
  • The paper introduces a novel mechanism that dynamically selects the dominant modality for fusion using learned integrity scores and multi-scale reconstruction losses.
  • Integrity-guided Adaptive Fusion is a framework that quantifies the completeness of input modalities and adaptively incorporates auxiliary data when primary signals are deficient.
  • It leverages cross-modal attention and a two-stage training strategy to maintain robustness and fine-grained predictions even under extreme missing data conditions.

Integrity-guided Adaptive Fusion (IF) is a mechanism within multimodal learning pipelines that dynamically selects and fuses information from multiple modalities based on their estimated integrity and reconstructed quality. In contemporary Multimodal Sentiment Analysis (MSA), where input data may suffer from uncertain or severe modality missingness, integrity-guided fusion prioritizes the most complete and trustworthy modalities during inference and enables the model to adaptively exploit auxiliary signals when the dominant input is noisy or ∂eficient. This approach has been operationalized in the Senti-iFusion framework, which targets both inter- and intra-modality missingness and achieves state-of-the-art results in fine-grained sentiment analysis tasks (Li et al., 21 Nov 2025).

1. Design Objectives and Core Principles

Integrity-guided Adaptive Fusion is constructed to address three central challenges in multimodal fusion under missing data: (i) selecting modalities for fusion on-the-fly, based on completeness and reconstruction quality; (ii) leveraging fallback strategies that incorporate auxiliary modalities when the dominant one is compromised; and (iii) fusing representations via cross-modal attention mechanisms rather than naive concatenation or averaging. Integrity of each modality is quantified as a learned integrity score I~m[0,1]Ĩ_m \in [0,1], with higher scores signifying more complete and reliable input. Quality, while not explicitly represented as a scalar, is enforced during training via multi-scale reconstruction and mutual-information losses, which collectively compel the recovery of semantic content consistent with the original input. This design enables the fusion process to privilege the most reliable modality for each sample or batch, while adaptively attending to auxiliary modalities in proportion to their semantic utility.

2. Mathematical Formulation and Training Objectives

Let U={Ul,Ua,Uv}U = \{U_l, U_a, U_v\} denote the input modalities (language, acoustic, visual), subjected to random inter- and intra-modality masking to yield U~\tilde{U}. Feature extraction and integrity estimation proceed as follows:

  • Embedding encoders Eembm\mathcal{E}_{\text{emb}}^m map masked inputs U~m\tilde{U}_m to incomplete embeddings u~m\tilde{u}_m.
  • A 2-layer Transformer-based integrity estimator produces

I~m=Eiem(Concat(Xie,u~m))Ĩ_m = \mathcal{E}_{\text{ie}}^m(\operatorname{Concat}(X_{\text{ie}}, \tilde{u}_m))

trained via the loss

Lie=1Nk=1N I~mkImk22L_{\text{ie}} = \frac{1}{N} \sum_{k=1}^N \|\ Ĩ_m^k - I_m^k \|^2_2

where Im=1(fraction of masked tokens)I_m = 1 - \text{(fraction of masked tokens)}.

Reconstruction quality is enforced using two sets of loss terms:

  • Feature-level MSE (LmsegL_{\text{mse}}^g) and MI (LmigL_{\text{mi}}^g) losses compare completed features u~mũ_m to their originals umu_m.
  • Semantic-level losses (LmsesL_{\text{mse}}^s, LmisL_{\text{mi}}^s) compare disentangled shared features h^msĥ_m^s against re-encoded reconstructions.

Dominant-modality selection is batch-based:

μm=1Bi=1BI~m(i)μ_m = \frac{1}{B} \sum_{i=1}^B Ĩ_m^{(i)}

and dom=argmaxm{l,a,v}μmdom = \arg\max_{m \in \{l, a, v\}} μ_m, with auxiliaries a1,a2a_1, a_2 as the other modalities.

3. Architecture and Data Flow

The architecture comprises several sequential modules:

Stage Function Output
Input masking Simulate real-world missingness Masked multimodal features U~\tilde{U}
Embedding Encode masked modalities Incomplete embeddings u~m\tilde{u}_m
Integrity estimation Assess completeness of each modality Integrity scores I~mĨ_m
Integrity-weighted completion Disentangle and reconstruct features Surrogate and recovered features
Adaptive fusion Select dominant modality, cross-attend Fused representation hfuse3h_{\text{fuse}}^3
Prediction Final Transformer+linear head Sentiment prediction y^\hat{y}

The completion module builds surrogate features:

h^msur=I~mu~m+(1I~m)(h^m1s+h^m2s)ĥ_m^{\text{sur}} = Ĩ_m \cdot \tilde{u}_m + (1 - Ĩ_m) \cdot (\hat{h}_{m_1}^s + \hat{h}_{m_2}^s)

and proceeds by decoding, re-encoding, and enforcing dual-depth losses.

Adaptive fusion then processes the dominant modality through Transformer layers and fuses them with auxiliary modalities using attention:

  • Queries QdomQ_{\text{dom}}, keys/values Km,VmK_m, V_m from auxiliaries, attention weights γm=softmax(QdomKm/dk)\gamma_m = \text{softmax}(Q_{\text{dom}} K_m^\top / \sqrt{d_k}).
  • Fused representations updated recursively:

hfusej=hfusej1+m{a1,a2}γmVmh_{\text{fuse}}^j = h_{\text{fuse}}^{j-1} + \sum_{m \in \{a_1, a_2\}} \gamma_m V_m

Classification uses the fused features, prepended with [CLS]-style tokens, processed by a cross-modal Transformer and a linear prediction head.

4. Algorithmic Workflow and Pseudocode

The IF module follows a two-stage training procedure:

  1. Initial epochs (stage 1): only integrity estimation and completion modules are trained, freezing the final predictor. The objective is

Lstage1=αLie+βLrecL_{\text{stage1}} = \alpha \cdot L_{\text{ie}} + \beta \cdot L_{\text{rec}}

where Lrec=Lrecenc+LrecdecL_{\text{rec}} = L_{\text{rec}}^{\text{enc}} + L_{\text{rec}}^{\text{dec}}.

  1. Final epochs (stage 2): all modules are updated with

Lstage2=αLie+βLrec+σLpredL_{\text{stage2}} = \alpha \cdot L_{\text{ie}} + \beta \cdot L_{\text{rec}} + \sigma \cdot L_{\text{pred}}

The operational pseudocode, paraphrased, follows the structure outlined below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
for modality in {language, acoustic, visual}:
    incomplete_embed = E_emb^modality(concat([E̲], masked_input))
    integrity_score = E_ie^modality(concat([I̲], incomplete_embed))  # completeness
    shared, private = E^s_modality(incomplete_embed), E^p_modality(incomplete_embed)
    # Surrogate construction, reconstruction, similarity/difference loss computation
    surrogate = integrity_score * incomplete_embed + (1 - integrity_score) * sum_of_shared_feats
    completed = Decoder_modality(surrogate)
    reencoded_shared = E^s_modality(completed)
    # Dual-depth losses: MSE & MI
if epoch <= pretrain_epochs:
    update integrity & completion modules only
else:
    # Adaptive fusion: dominant modality selection, cross-attention
    mean_integrity_per_modality  dominant_modality
    processed_dom = process_through_transformers(surrogate)
    fused = processed_dom
    for fusion_layer:
        query = processed_dom @ Q_weight
        for aux_modality:
            key = surrogate @ K_weight
            value = surrogate @ V_weight
            attention = softmax(query @ key.T / sqrt(d_k))
            fused += attention * value
    # Prediction: Transformer + linear head
    final_pred = E_pred(concat([CLS_dom, processed_dom], [CLS_fuse, fused]))
    sentiment_out = Linear(final_pred)
    update all modules end-to-end

5. Hyperparameterization and Ablation Outcomes

Key operational hyperparameters are:

  • Input length T=8T=8, hidden dimension d=128d=128, batch size $64$.
  • AdamW optimizer with 1×1041 \times 10^{-4} initial learning rate, weight decay 1×1041 \times 10^{-4}, cosine annealing, warm-up and early stopping.
  • Stage 1: $40$ epochs; stage 2: $110$ (MOSI) or $160$ (MOSEI) epochs.
  • Loss weights: α=0.9\alpha=0.9, β=0.4\beta=0.4, σ=1.0\sigma=1.0; decoder-specific: λmseg=0.5\lambda_{\text{mse}}^g=0.5, λmig=0.4\lambda_{\text{mi}}^g=0.4, λmses=0.3\lambda_{\text{mse}}^s=0.3, λmis=0.2\lambda_{\text{mi}}^s=0.2.

Ablation experiments under 50%50\% drop-rate quantify the necessity of each module:

  • Omitting integrity-weighted surrogates raises MAE (MOSI) from $1.1554$ to $1.1740$ and reduces F1 score.
  • No integrity loss (LieL_{\text{ie}}) drops Acc-7 by 3%\sim3\%.
  • Removing dual-depth reconstruction loss (LrecdecL_{\text{rec}}^{\text{dec}}) cuts F1 by $0.01-0.02$.
  • Disabling the two-stage strategy degrades both MAE and classification accuracy.

Under extreme missingness (drop rate up to $0.9$), conventional fusion methods collapse to a single dominant class, whereas Senti-iFusion maintains fine-grained sentiment predictions, demonstrating resilience and reliability due to integrity-guided fusion.

6. Significance, Context, and Implications

Integrity-guided Adaptive Fusion redefines multimodal fusion in incomplete and noisy data regimes by explicit evaluation of modality integrity and dynamic attention-driven feature fusion. Its empirical advantage is bolstered by robust ablation results: state-of-the-art performance in MAE, F1, and Acc-5 metrics when tested on prevalent benchmarks subject to simulated missingness (Li et al., 21 Nov 2025). The approach enables models to continuously adapt to the best available information and recover from missing cues, which is crucial for real-world deployments in human-computer interaction, medical informatics, and any application area where multimodal signals are prone to degradation. A plausible implication is broader adoption of integrity-guided mechanisms in future architectures where reliability of observed modalities cannot be guaranteed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Integrity-guided Adaptive Fusion.