Dual-Pyramidal MFAM for Bipolar Disorder Diagnosis
- The paper demonstrates that the dual-pyramidal MFAM achieves up to an 11.4% improvement in balanced accuracy for bipolar disorder diagnosis by effectively fusing sMRI and fMRI features.
- The MFAM employs dedicated pyramid-based modules—P2FEM for sMRI and SFAM for fMRI—to extract hierarchical features that capture both anatomical details and spatio-temporal dynamics.
- The fusion strategy concatenates complementary embeddings without explicit gating, simplifying the model design while ensuring robust and reliable integration for classification.
Dual-pyramidal Multimodal Fusion Architecture (MFAM) is a neural network design tailored for the diagnosis of bipolar disorder using both structural MRI (sMRI) and functional MRI (fMRI) data. The framework, as detailed by Liu et al., incorporates two distinct pyramid-based feature extractors—one for sMRI and one for fMRI—followed by an explicit multimodal fusion layer and classification head. The dual-pyramidal structure is motivated by the complementary nature of sMRI (capturing anatomical detail) and fMRI (capturing spatio-temporal dynamics), with fusion shown to yield state-of-the-art diagnostic performance compared to unimodal and non-hierarchical fusion baselines (Wang et al., 2024).
1. Dual-Pyramid Feature Extraction
Patch Pyramid Feature Extraction Module (P2FEM) for sMRI
sMRI data, represented as T₁-weighted 3D structural volumes (with , ), are processed using L=4 successive 3D convolutional layers. Each layer employs:
- A relatively large kernel size ,
- Stride for down-sampling,
- Group-wise convolutions for parameter efficiency,
- Output channels .
At layer ,
followed by batch normalization and ReLU. Dimension changes per layer are determined as (and likewise for ).
Example configuration:
The flattened outputs are concatenated:
Spatio-Temporal Pyramid Feature Extraction Module (SFAM) for fMRI
The fMRI input is treated as a matrix of parcellated regional time series (e.g., frames, regions). SFAM constructs a -level temporal pyramid. At each level :
- Segment the series into overlapping windows of size , stride .
- Each segment is encoded by a two-layer MLP with batch norm and ReLU:
- A 1D convolution along the temporal axis generates
- Aggregate (mean-pool or max-pool) the segment outputs:
- Final feature: concatenate pyramid levels
2. Multimodal Fusion
The sMRI vector and fMRI vector are each projected to a common embedding dimension via separate fully-connected bottleneck layers with batch norm and ReLU:
These are then concatenated (no explicit attention/gating):
This combined embedding serves as input to the downstream classifier.
3. Classification Strategies
Two classification heads are evaluated:
- Ours-Dense: One hidden layer MLP (with ReLU, dropout, batch norm), followed by a 2-way softmax.
- Ours-Linear: Direct linear softmax classifier:
Optimization employs standard cross-entropy loss with regularization:
4. Data Processing and Experimental Protocols
Preprocessing follows established neuroimaging conventions:
- sMRI: Skull-stripping, affine normalization to MNI space, intensity standardization, resizing to .
- fMRI: Slice-timing correction, motion realignment, nuisance regression, regional parcellation, and per-series standardization. Experiments are conducted with five-fold cross-validation on:
- Beijing Huilongguan clinical cohort (),
- Public OpenfMRI ().
Hyperparameters are tuned by grid search:
- Optimizer: Adam (, ),
- Learning rate: (halved on plateau),
- Batch size: 8,
- Weight decay: ,
- Dropout (Dense head): 0.5,
- Epochs: up to 100 with early stopping on validation loss.
5. Quantitative Results and Ablation Findings
Results from (Wang et al., 2024) show that the dual-pyramidal MFAM achieves state-of-the-art balanced accuracy (BACC):
| Dataset | Baseline (Late Fusion) | Ours-Linear | Improvement |
|---|---|---|---|
| OpenfMRI (Public) | 0.657 | 0.732 | +11.4% |
| Clinical Cohort | 0.686 | 0.766 | +8.0% |
Ablation with the linear head on OpenfMRI:
- sMRI only: BACC=0.575
- fMRI only: BACC=0.658
- sMRI+fMRI (full MFAM): BACC=0.739
These results support the claim that each feature branch captures unique, complementary information and that simple embedding concatenation suffices for effective multimodal integration.
6. Architectural Significance and Future Implications
The dual-pyramidal MFAM exemplifies an explicit fusion architecture balancing deep hierarchical representation and dimensionality control for neuroimaging applications. These results underscore the value of structured, multiscale feature extraction for both structural and functional imaging modalities. A plausible implication is that further advances may arise from more sophisticated fusion operators or domain-optimized pyramid designs. No explicit attention or gating is required to achieve strong results in this framework, supporting the efficacy of concatenated embeddings in certain multimodal medical contexts (Wang et al., 2024).