Dual-Pyramidal MFAM for Bipolar Disorder Diagnosis

Updated 25 February 2026

The paper demonstrates that the dual-pyramidal MFAM achieves up to an 11.4% improvement in balanced accuracy for bipolar disorder diagnosis by effectively fusing sMRI and fMRI features.
The MFAM employs dedicated pyramid-based modules—P2FEM for sMRI and SFAM for fMRI—to extract hierarchical features that capture both anatomical details and spatio-temporal dynamics.
The fusion strategy concatenates complementary embeddings without explicit gating, simplifying the model design while ensuring robust and reliable integration for classification.

Dual-pyramidal Multimodal Fusion Architecture (MFAM) is a neural network design tailored for the diagnosis of bipolar disorder using both structural MRI (sMRI) and functional MRI (fMRI) data. The framework, as detailed by Liu et al., incorporates two distinct pyramid-based feature extractors—one for sMRI and one for fMRI—followed by an explicit multimodal fusion layer and classification head. The dual-pyramidal structure is motivated by the complementary nature of sMRI (capturing anatomical detail) and fMRI (capturing spatio-temporal dynamics), with fusion shown to yield state-of-the-art diagnostic performance compared to unimodal and non-hierarchical fusion baselines (Wang et al., 2024).

1. Dual-Pyramid Feature Extraction

Patch Pyramid Feature Extraction Module (P2FEM) for sMRI

sMRI data, represented as T₁-weighted 3D structural volumes $X_s \in \mathbb{R}^{H \times W \times D}$ (with $H=W=256$ , $D=188$ ), are processed using L=4 successive 3D convolutional layers. Each layer employs:

A relatively large kernel size $K_\ell$ ,
Stride $S_\ell>1$ for down-sampling,
Group-wise convolutions $G_\ell$ for parameter efficiency,
Output channels $C_\ell$ .

At layer $\ell$ ,

$F_\ell = \text{Conv3D}_{(K_\ell, S_\ell, G_\ell), C_\ell}(F_{\ell-1}), \quad F_0 = X_s,$

followed by batch normalization and ReLU. Dimension changes per layer are determined as $H_\ell = \left\lfloor \frac{H_{\ell-1} - K_\ell}{S_\ell} \right\rfloor + 1$ (and likewise for $W_\ell, D_\ell$ ).

Example configuration:

$K_1=7, S_1=2, G_1=1, C_1=32 \rightarrow F_1 \in \mathbb{R}^{32 \times 128 \times 128 \times 94}$
$K_2=5, S_2=2, G_2=2, C_2=64 \rightarrow F_2 \in \mathbb{R}^{64 \times 64 \times 64 \times 47}$
$K_3=3, S_3=2, G_3=4, C_3=128 \rightarrow F_3 \in \mathbb{R}^{128 \times 32 \times 32 \times 24}$
$K_4=3, S_4=2, G_4=8, C_4=256 \rightarrow F_4 \in \mathbb{R}^{256 \times 16 \times 16 \times 12}$

The flattened outputs are concatenated:

$f_s = \text{Flatten}([F_1, F_2, F_3, F_4]) \in \mathbb{R}^{M_s}$

Spatio-Temporal Pyramid Feature Extraction Module (SFAM) for fMRI

The fMRI input is treated as a matrix of parcellated regional time series $P \in \mathbb{R}^{M \times N}$ (e.g., $M=210$ frames, $N=463$ regions). SFAM constructs a $T$ -level temporal pyramid. At each level $t$ :

Segment the series into $R_t$ overlapping windows of size $W_t$ , stride $U_t$ .
Each segment $s_t^r \in \mathbb{R}^{W_t \times N}$ is encoded by a two-layer MLP with batch norm and ReLU:

$z_t^r = \text{ReLU}(\text{BN}(W_{s2}^t (\text{ReLU}(\text{BN}(W_{s1}^t s_t^r + b_{s1}^t))) + b_{s2}^t))$

A 1D convolution along the temporal axis generates

$h_t^r = \sigma(\text{BN}(\text{Conv1D}_{(k_t,u_t)}(z_t^r)))$

Aggregate (mean-pool or max-pool) the $R_t$ segment outputs:

$f_t = \frac{1}{R_t} \sum_{r=1}^{R_t} \text{Pool}(h_t^r) \in \mathbb{R}^d$

Final feature: concatenate pyramid levels

$f_f = f_1 \| f_2 \| \cdots \| f_T \in \mathbb{R}^{T \cdot d}$

2. Multimodal Fusion

The sMRI vector $f_s$ and fMRI vector $f_f$ are each projected to a common embedding dimension $D$ via separate fully-connected bottleneck layers with batch norm and ReLU:

$u_s = \text{ReLU}(\text{BN}(W_s f_s + b_s)) \in \mathbb{R}^D$

$u_f = \text{ReLU}(\text{BN}(W_f f_f + b_f)) \in \mathbb{R}^D$

These are then concatenated (no explicit attention/gating):

$u_{\text{concat}} = u_s \| u_f \in \mathbb{R}^{2D}$

This combined embedding serves as input to the downstream classifier.

3. Classification Strategies

Two classification heads are evaluated:

Ours-Dense: One hidden layer MLP (with ReLU, dropout, batch norm), followed by a 2-way softmax.
Ours-Linear: Direct linear softmax classifier:

$\hat{y} = \text{Softmax}(W_c\,u_{\text{concat}} + b_c)$

Optimization employs standard cross-entropy loss with $L_2$ regularization:

$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \sum_{k \in \{0, 1\}} y_i^{(k)} \log \hat{y}_i^{(k)} + \lambda \|W_c\|_2^2$

4. Data Processing and Experimental Protocols

Preprocessing follows established neuroimaging conventions:

sMRI: Skull-stripping, affine normalization to MNI space, intensity standardization, resizing to $256 \times 256 \times 188$ .
fMRI: Slice-timing correction, motion realignment, nuisance regression, regional parcellation, and per-series standardization. Experiments are conducted with five-fold cross-validation on:
Beijing Huilongguan clinical cohort ( $n=91$ ),
Public OpenfMRI ( $n=171$ ).

Hyperparameters are tuned by grid search:

Optimizer: Adam ( $\beta_1=0.9$ , $\beta_2=0.999$ ),
Learning rate: $1 \times 10^{-4}$ (halved on plateau),
Batch size: 8,
Weight decay: $1 \times 10^{-5}$ ,
Dropout (Dense head): 0.5,
Epochs: up to 100 with early stopping on validation loss.

5. Quantitative Results and Ablation Findings

Results from (Wang et al., 2024) show that the dual-pyramidal MFAM achieves state-of-the-art balanced accuracy (BACC):

Dataset	Baseline (Late Fusion)	Ours-Linear	Improvement
OpenfMRI (Public)	0.657	0.732	+11.4%
Clinical Cohort	0.686	0.766	+8.0%

Ablation with the linear head on OpenfMRI:

sMRI only: BACC=0.575
fMRI only: BACC=0.658
sMRI+fMRI (full MFAM): BACC=0.739

These results support the claim that each feature branch captures unique, complementary information and that simple embedding concatenation suffices for effective multimodal integration.

6. Architectural Significance and Future Implications

The dual-pyramidal MFAM exemplifies an explicit fusion architecture balancing deep hierarchical representation and dimensionality control for neuroimaging applications. These results underscore the value of structured, multiscale feature extraction for both structural and functional imaging modalities. A plausible implication is that further advances may arise from more sophisticated fusion operators or domain-optimized pyramid designs. No explicit attention or gating is required to achieve strong results in this framework, supporting the efficacy of concatenated embeddings in certain multimodal medical contexts (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

A Bi-Pyramid Multimodal Fusion Method for the Diagnosis of Bipolar Disorders (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-pyramidal MFAM.