Integrity-guided Adaptive Fusion
- The paper introduces a novel mechanism that dynamically selects the dominant modality for fusion using learned integrity scores and multi-scale reconstruction losses.
- Integrity-guided Adaptive Fusion is a framework that quantifies the completeness of input modalities and adaptively incorporates auxiliary data when primary signals are deficient.
- It leverages cross-modal attention and a two-stage training strategy to maintain robustness and fine-grained predictions even under extreme missing data conditions.
Integrity-guided Adaptive Fusion (IF) is a mechanism within multimodal learning pipelines that dynamically selects and fuses information from multiple modalities based on their estimated integrity and reconstructed quality. In contemporary Multimodal Sentiment Analysis (MSA), where input data may suffer from uncertain or severe modality missingness, integrity-guided fusion prioritizes the most complete and trustworthy modalities during inference and enables the model to adaptively exploit auxiliary signals when the dominant input is noisy or ∂eficient. This approach has been operationalized in the Senti-iFusion framework, which targets both inter- and intra-modality missingness and achieves state-of-the-art results in fine-grained sentiment analysis tasks (Li et al., 21 Nov 2025).
1. Design Objectives and Core Principles
Integrity-guided Adaptive Fusion is constructed to address three central challenges in multimodal fusion under missing data: (i) selecting modalities for fusion on-the-fly, based on completeness and reconstruction quality; (ii) leveraging fallback strategies that incorporate auxiliary modalities when the dominant one is compromised; and (iii) fusing representations via cross-modal attention mechanisms rather than naive concatenation or averaging. Integrity of each modality is quantified as a learned integrity score , with higher scores signifying more complete and reliable input. Quality, while not explicitly represented as a scalar, is enforced during training via multi-scale reconstruction and mutual-information losses, which collectively compel the recovery of semantic content consistent with the original input. This design enables the fusion process to privilege the most reliable modality for each sample or batch, while adaptively attending to auxiliary modalities in proportion to their semantic utility.
2. Mathematical Formulation and Training Objectives
Let denote the input modalities (language, acoustic, visual), subjected to random inter- and intra-modality masking to yield . Feature extraction and integrity estimation proceed as follows:
- Embedding encoders map masked inputs to incomplete embeddings .
- A 2-layer Transformer-based integrity estimator produces
trained via the loss
where .
Reconstruction quality is enforced using two sets of loss terms:
- Feature-level MSE () and MI () losses compare completed features to their originals .
- Semantic-level losses (, ) compare disentangled shared features against re-encoded reconstructions.
Dominant-modality selection is batch-based:
and , with auxiliaries as the other modalities.
3. Architecture and Data Flow
The architecture comprises several sequential modules:
| Stage | Function | Output |
|---|---|---|
| Input masking | Simulate real-world missingness | Masked multimodal features |
| Embedding | Encode masked modalities | Incomplete embeddings |
| Integrity estimation | Assess completeness of each modality | Integrity scores |
| Integrity-weighted completion | Disentangle and reconstruct features | Surrogate and recovered features |
| Adaptive fusion | Select dominant modality, cross-attend | Fused representation |
| Prediction | Final Transformer+linear head | Sentiment prediction |
The completion module builds surrogate features:
and proceeds by decoding, re-encoding, and enforcing dual-depth losses.
Adaptive fusion then processes the dominant modality through Transformer layers and fuses them with auxiliary modalities using attention:
- Queries , keys/values from auxiliaries, attention weights .
- Fused representations updated recursively:
Classification uses the fused features, prepended with [CLS]-style tokens, processed by a cross-modal Transformer and a linear prediction head.
4. Algorithmic Workflow and Pseudocode
The IF module follows a two-stage training procedure:
- Initial epochs (stage 1): only integrity estimation and completion modules are trained, freezing the final predictor. The objective is
where .
- Final epochs (stage 2): all modules are updated with
The operational pseudocode, paraphrased, follows the structure outlined below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
for modality in {language, acoustic, visual}: incomplete_embed = E_emb^modality(concat([E̲], masked_input)) integrity_score = E_ie^modality(concat([I̲], incomplete_embed)) # completeness shared, private = E^s_modality(incomplete_embed), E^p_modality(incomplete_embed) # Surrogate construction, reconstruction, similarity/difference loss computation surrogate = integrity_score * incomplete_embed + (1 - integrity_score) * sum_of_shared_feats completed = Decoder_modality(surrogate) reencoded_shared = E^s_modality(completed) # Dual-depth losses: MSE & MI if epoch <= pretrain_epochs: update integrity & completion modules only else: # Adaptive fusion: dominant modality selection, cross-attention mean_integrity_per_modality → dominant_modality processed_dom = process_through_transformers(surrogate) fused = processed_dom for fusion_layer: query = processed_dom @ Q_weight for aux_modality: key = surrogate @ K_weight value = surrogate @ V_weight attention = softmax(query @ key.T / sqrt(d_k)) fused += attention * value # Prediction: Transformer + linear head final_pred = E_pred(concat([CLS_dom, processed_dom], [CLS_fuse, fused])) sentiment_out = Linear(final_pred) update all modules end-to-end |
5. Hyperparameterization and Ablation Outcomes
Key operational hyperparameters are:
- Input length , hidden dimension , batch size $64$.
- AdamW optimizer with initial learning rate, weight decay , cosine annealing, warm-up and early stopping.
- Stage 1: $40$ epochs; stage 2: $110$ (MOSI) or $160$ (MOSEI) epochs.
- Loss weights: , , ; decoder-specific: , , , .
Ablation experiments under drop-rate quantify the necessity of each module:
- Omitting integrity-weighted surrogates raises MAE (MOSI) from $1.1554$ to $1.1740$ and reduces F1 score.
- No integrity loss () drops Acc-7 by .
- Removing dual-depth reconstruction loss () cuts F1 by $0.01-0.02$.
- Disabling the two-stage strategy degrades both MAE and classification accuracy.
Under extreme missingness (drop rate up to $0.9$), conventional fusion methods collapse to a single dominant class, whereas Senti-iFusion maintains fine-grained sentiment predictions, demonstrating resilience and reliability due to integrity-guided fusion.
6. Significance, Context, and Implications
Integrity-guided Adaptive Fusion redefines multimodal fusion in incomplete and noisy data regimes by explicit evaluation of modality integrity and dynamic attention-driven feature fusion. Its empirical advantage is bolstered by robust ablation results: state-of-the-art performance in MAE, F1, and Acc-5 metrics when tested on prevalent benchmarks subject to simulated missingness (Li et al., 21 Nov 2025). The approach enables models to continuously adapt to the best available information and recover from missing cues, which is crucial for real-world deployments in human-computer interaction, medical informatics, and any application area where multimodal signals are prone to degradation. A plausible implication is broader adoption of integrity-guided mechanisms in future architectures where reliability of observed modalities cannot be guaranteed.