M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis

Published 2 Apr 2026 in cs.AI and cs.CV | (2604.01667v1)

Abstract: Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into the same model with identical computation, ignoring inherent difference between input samples. This lack of sample adaptation hinders model's further performance. To this end, we innovatively propose a multi-stage dynamic fusion strategy (M3D-BFS) for sample-adaptive multi-modal brain network analysis. Unlike other static fusion methods, we design different mixture-of-experts (MoEs) for uni- and multi-modal representations where modules can adaptively change as input sample changes during inference. To alleviate issue of MoE where training of experts may be collapsed, we divide our method into 3 stages. We first train uni-modal encoders respectively, then pretrain single experts of MoEs before finally finetuning the whole model. A multi-modal disentanglement loss is designed to enhance the final representations. To the best of our knowledge, this is the first work for dynamic fusion for multi-modal brain network analysis. Extensive experiments on different real-world datasets demonstrates the superiority of M3D-BFS.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a dynamic multi-stage MoE fusion mechanism that adapts to sample heterogeneity in multi-modal brain network analysis.
It employs modality-specific pretraining and gating to integrate structural and functional connectivity for robust representation learning.
Empirical evaluations on gender classification and MDD diagnosis demonstrate improved accuracy and robustness across metrics.

Introduction

The paper "M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis" (2604.01667) addresses the limitations of static fusion in existing multi-modal brain network methods, which process all input samples with identical computational pathways, thereby disregarding sample heterogeneity. The study introduces M $^3$ D-BFS, a mixture-of-experts (MoE) based multi-stage framework designed to achieve sample-adaptive fusion by dynamically selecting modality-specific and fused representations at the node level for each sample. The approach is evaluated on gender classification and major depressive disorder (MDD) diagnosis tasks, yielding substantive improvements over both static and dynamic baselines.

Figure 1: The M $^3$ D-BFS pipeline adaptively alters expert selection according to each input sample’s characteristics.

Motivation and Problem Setting

Multi-modal fusion is essential for comprehensive brain network analysis, leveraging complementary information provided by structural connectivity (SC, typically derived from dMRI) and functional connectivity (FC, typically from fMRI). However, the inherent variability in regional brain connectivity and the heterogeneity across subjects are neglected in most static graph fusion methods. This limits the ability of such frameworks to model variations at the sample or node level, resulting in suboptimal task performance in diverse clinical and demographic contexts.

Dynamic fusion paradigms, well-established in NLP and CV, have not been previously adapted effectively to graph-structured brain networks. The proposed M $^3$ D-BFS bridges this gap by using gating mechanisms and multi-expert ensembles to enable sample- and region-adaptive fusion.

Methodological Framework

Architecture and Fusion Strategy

M $^3$ D-BFS operates in three phases (Figure 2):

Uni-modal Encoder Pretraining: Independent SC/FC graph convolutional encoders are pretrained for modality-specific representation learning, addressing modality laziness and facilitating stronger marginal encoders.
Single Expert Pretraining: Each expert MLP for SC, FC, and Fusion is pretrained in isolation, using cross-entropy, contrastive, and knowledge distillation losses for robust alignment and regularization.
MoE Finetuning: The core MoE blocks are fine-tuned with sample-adaptive gating, using the pretrained expert weights as initialization. Only the gates and experts are updated, with other network components frozen.
Figure 2: Overview of M $^3$ D-BFS: Three-stage training integrates unimodal and multimodal representations via dynamic MoE fusion blocks and disentanglement objectives.

During inference, the gating network routes representations for each node (brain region) to modality-specific or fusion experts on a per-sample basis, in a manner dependent on the input SC/FC graphs.

Losses and Regularization

Distillation loss ensures that the multimodal encoders preserve information from unimodal pathways.
Contrastive alignment (InfoNCE-based) maximally aligns paired SC and FC representations.
MoE balancing loss (combining importance and load penalties) prevents expert collapse, forcing diverse expert utilization across samples and regions.
Disentanglement loss encourages orthogonality among SC, FC, and fused embeddings, enhancing the informativeness of the final joint representations for downstream prediction.

The framework uses SC graphs constructed from dMRI tractography (AAL parcellation, fiber counts as edge weights, resulting in thresholded adjacency and node feature matrices) and FC graphs obtained from fMRI correlation matrices computed similarly.

Figure 3: Construction process for multi-modal brain networks, resulting in SC and FC graphs with region-wise features.

Empirical Evaluation

Experimental Design

Experiments use two datasets:

The Human Connectome Project (HCP) for gender classification
A hospital-sourced MDD dataset (zhongdaxinxiang)

Evaluation metrics include accuracy, sensitivity, specificity, F1, and AUC. Baselines comprise unimodal GCNs and transformers (BrainNPT), static fusion methods (SVM, Random Forest), representation-level and graph-level fusion (MMGNN, AL-NEGAT, Cross-GNN), and transformer-based fusion models (RH-BrainFS, NeuroPath).

Numerical Results

M $^3$ D-BFS yields the strongest numerical performance across all metrics on both datasets, e.g., 81.21% accuracy (HCP, +2.28% over Cross-GNN) and 80.85% accuracy (MDD, +2.73% over the next-best method). The improvement is consistent across sensitivity, specificity, and AUC, indicating robust generalization and clinical reliability.

Ablation and Sensitivity Analysis

Ablation demonstrates that each training stage and regularization component is critical for optimal accuracy and stability, with the multi-modal disentanglement loss further enhancing representation quality. Hyperparameter sweeps confirm architectural robustness, with little degradation for varying expert counts or regularization weights.

Figure 4: Sensitivity analysis indicating stable accuracy performance under hyperparameter perturbations and expert number variation.

Expert Utilization Visualization

Fusion block visualizations indicate that MoEs learn non-collapsed, balanced modality utilization without a dominant expert or degeneracy, in contrast to collapsed baselines. The per-node expert selection aligns with observed sample heterogeneity in brain networks.

Figure 5: Fusion MoE blocks show balanced expert selection between SC and FC modalities across layers and samples.

Discussion and Theoretical Implications

This study establishes the efficacy of adaptive MoE fusion for brain network analysis. By integrating multiple stages of pretraining and constraint-driven optimization, M $^3$ D-BFS mitigates classic issues in multi-modal learning such as modality laziness, expert collapse, and insufficient label signal for disentangled embedding learning. Practically, the sample-specific routing of brain region features can better accommodate inter-subject variability—an essential property for translational neuroscientific and psychiatric applications. Theoretically, this framework unifies recent advances in dynamic fusion, contrastive alignment, and balanced expert training for multi-modal graphs.

Future Directions

Potential research avenues include:

Extension to settings with more than two modalities or unequal graph structure across modalities.
Adapting to tasks involving distributional shifts or demographic confounding.
Exploration of dynamic fusion with more complex expert architectures and gating topologies.

Conclusion

M $^3$ D-BFS introduces a formal and scalable mechanism for dynamic, sample-adaptive fusion in multi-modal brain network analysis, overcoming the limitations of static fusion architectures. Its strong quantitative performance and ablation clarity underscore its utility as a foundation for next-generation clinical brain network modeling and neuroinformatics research.