Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-Pyramidal MFAM for Bipolar Disorder Diagnosis

Updated 25 February 2026
  • The paper demonstrates that the dual-pyramidal MFAM achieves up to an 11.4% improvement in balanced accuracy for bipolar disorder diagnosis by effectively fusing sMRI and fMRI features.
  • The MFAM employs dedicated pyramid-based modules—P2FEM for sMRI and SFAM for fMRI—to extract hierarchical features that capture both anatomical details and spatio-temporal dynamics.
  • The fusion strategy concatenates complementary embeddings without explicit gating, simplifying the model design while ensuring robust and reliable integration for classification.

Dual-pyramidal Multimodal Fusion Architecture (MFAM) is a neural network design tailored for the diagnosis of bipolar disorder using both structural MRI (sMRI) and functional MRI (fMRI) data. The framework, as detailed by Liu et al., incorporates two distinct pyramid-based feature extractors—one for sMRI and one for fMRI—followed by an explicit multimodal fusion layer and classification head. The dual-pyramidal structure is motivated by the complementary nature of sMRI (capturing anatomical detail) and fMRI (capturing spatio-temporal dynamics), with fusion shown to yield state-of-the-art diagnostic performance compared to unimodal and non-hierarchical fusion baselines (Wang et al., 2024).

1. Dual-Pyramid Feature Extraction

Patch Pyramid Feature Extraction Module (P2FEM) for sMRI

sMRI data, represented as T₁-weighted 3D structural volumes XsRH×W×DX_s \in \mathbb{R}^{H \times W \times D} (with H=W=256H=W=256, D=188D=188), are processed using L=4 successive 3D convolutional layers. Each layer employs:

  • A relatively large kernel size KK_\ell,
  • Stride S>1S_\ell>1 for down-sampling,
  • Group-wise convolutions GG_\ell for parameter efficiency,
  • Output channels CC_\ell.

At layer \ell,

F=Conv3D(K,S,G),C(F1),F0=Xs,F_\ell = \text{Conv3D}_{(K_\ell, S_\ell, G_\ell), C_\ell}(F_{\ell-1}), \quad F_0 = X_s,

followed by batch normalization and ReLU. Dimension changes per layer are determined as H=H1KS+1H_\ell = \left\lfloor \frac{H_{\ell-1} - K_\ell}{S_\ell} \right\rfloor + 1 (and likewise for W,DW_\ell, D_\ell).

Example configuration:

  • K1=7,S1=2,G1=1,C1=32F1R32×128×128×94K_1=7, S_1=2, G_1=1, C_1=32 \rightarrow F_1 \in \mathbb{R}^{32 \times 128 \times 128 \times 94}
  • K2=5,S2=2,G2=2,C2=64F2R64×64×64×47K_2=5, S_2=2, G_2=2, C_2=64 \rightarrow F_2 \in \mathbb{R}^{64 \times 64 \times 64 \times 47}
  • K3=3,S3=2,G3=4,C3=128F3R128×32×32×24K_3=3, S_3=2, G_3=4, C_3=128 \rightarrow F_3 \in \mathbb{R}^{128 \times 32 \times 32 \times 24}
  • K4=3,S4=2,G4=8,C4=256F4R256×16×16×12K_4=3, S_4=2, G_4=8, C_4=256 \rightarrow F_4 \in \mathbb{R}^{256 \times 16 \times 16 \times 12}

The flattened outputs are concatenated:

fs=Flatten([F1,F2,F3,F4])RMsf_s = \text{Flatten}([F_1, F_2, F_3, F_4]) \in \mathbb{R}^{M_s}

Spatio-Temporal Pyramid Feature Extraction Module (SFAM) for fMRI

The fMRI input is treated as a matrix of parcellated regional time series PRM×NP \in \mathbb{R}^{M \times N} (e.g., M=210M=210 frames, N=463N=463 regions). SFAM constructs a TT-level temporal pyramid. At each level tt:

  • Segment the series into RtR_t overlapping windows of size WtW_t, stride UtU_t.
  • Each segment strRWt×Ns_t^r \in \mathbb{R}^{W_t \times N} is encoded by a two-layer MLP with batch norm and ReLU:

ztr=ReLU(BN(Ws2t(ReLU(BN(Ws1tstr+bs1t)))+bs2t))z_t^r = \text{ReLU}(\text{BN}(W_{s2}^t (\text{ReLU}(\text{BN}(W_{s1}^t s_t^r + b_{s1}^t))) + b_{s2}^t))

  • A 1D convolution along the temporal axis generates

htr=σ(BN(Conv1D(kt,ut)(ztr)))h_t^r = \sigma(\text{BN}(\text{Conv1D}_{(k_t,u_t)}(z_t^r)))

  • Aggregate (mean-pool or max-pool) the RtR_t segment outputs:

ft=1Rtr=1RtPool(htr)Rdf_t = \frac{1}{R_t} \sum_{r=1}^{R_t} \text{Pool}(h_t^r) \in \mathbb{R}^d

  • Final feature: concatenate pyramid levels

ff=f1f2fTRTdf_f = f_1 \| f_2 \| \cdots \| f_T \in \mathbb{R}^{T \cdot d}

2. Multimodal Fusion

The sMRI vector fsf_s and fMRI vector fff_f are each projected to a common embedding dimension DD via separate fully-connected bottleneck layers with batch norm and ReLU:

us=ReLU(BN(Wsfs+bs))RDu_s = \text{ReLU}(\text{BN}(W_s f_s + b_s)) \in \mathbb{R}^D

uf=ReLU(BN(Wfff+bf))RDu_f = \text{ReLU}(\text{BN}(W_f f_f + b_f)) \in \mathbb{R}^D

These are then concatenated (no explicit attention/gating):

uconcat=usufR2Du_{\text{concat}} = u_s \| u_f \in \mathbb{R}^{2D}

This combined embedding serves as input to the downstream classifier.

3. Classification Strategies

Two classification heads are evaluated:

  • Ours-Dense: One hidden layer MLP (with ReLU, dropout, batch norm), followed by a 2-way softmax.
  • Ours-Linear: Direct linear softmax classifier:

y^=Softmax(Wcuconcat+bc)\hat{y} = \text{Softmax}(W_c\,u_{\text{concat}} + b_c)

Optimization employs standard cross-entropy loss with L2L_2 regularization:

L=1Ni=1Nk{0,1}yi(k)logy^i(k)+λWc22\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \sum_{k \in \{0, 1\}} y_i^{(k)} \log \hat{y}_i^{(k)} + \lambda \|W_c\|_2^2

4. Data Processing and Experimental Protocols

Preprocessing follows established neuroimaging conventions:

  • sMRI: Skull-stripping, affine normalization to MNI space, intensity standardization, resizing to 256×256×188256 \times 256 \times 188.
  • fMRI: Slice-timing correction, motion realignment, nuisance regression, regional parcellation, and per-series standardization. Experiments are conducted with five-fold cross-validation on:
  • Beijing Huilongguan clinical cohort (n=91n=91),
  • Public OpenfMRI (n=171n=171).

Hyperparameters are tuned by grid search:

  • Optimizer: Adam (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999),
  • Learning rate: 1×1041 \times 10^{-4} (halved on plateau),
  • Batch size: 8,
  • Weight decay: 1×1051 \times 10^{-5},
  • Dropout (Dense head): 0.5,
  • Epochs: up to 100 with early stopping on validation loss.

5. Quantitative Results and Ablation Findings

Results from (Wang et al., 2024) show that the dual-pyramidal MFAM achieves state-of-the-art balanced accuracy (BACC):

Dataset Baseline (Late Fusion) Ours-Linear Improvement
OpenfMRI (Public) 0.657 0.732 +11.4%
Clinical Cohort 0.686 0.766 +8.0%

Ablation with the linear head on OpenfMRI:

  • sMRI only: BACC=0.575
  • fMRI only: BACC=0.658
  • sMRI+fMRI (full MFAM): BACC=0.739

These results support the claim that each feature branch captures unique, complementary information and that simple embedding concatenation suffices for effective multimodal integration.

6. Architectural Significance and Future Implications

The dual-pyramidal MFAM exemplifies an explicit fusion architecture balancing deep hierarchical representation and dimensionality control for neuroimaging applications. These results underscore the value of structured, multiscale feature extraction for both structural and functional imaging modalities. A plausible implication is that further advances may arise from more sophisticated fusion operators or domain-optimized pyramid designs. No explicit attention or gating is required to achieve strong results in this framework, supporting the efficacy of concatenated embeddings in certain multimodal medical contexts (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-pyramidal MFAM.