Mixture of Pooling-Classifier Experts (MoE)
- MoE is a neural module that integrates multiple specialized expert branches within Transformer pipelines to analyze high-dimensional fMRI connectivity data.
- It employs expert-specific attention pooling and a gating mechanism to adaptively weight contributions, thereby enhancing classification accuracy and interpretability.
- Empirical results on the ABIDE dataset demonstrate superior performance over single-expert decoders, highlighting improved sensitivity, specificity, and overall diagnostic accuracy.
A Mixture of Pooling-Classifier Experts (MoE) is a neural module designed to enable Transformers to integrate multiple specialized expert branches, each focusing on distinct patterns within high-dimensional networks such as functional connectivity (FC) matrices from functional MRI. In the context of ASDFormer, MoE provides adaptive, interpretable pooling and classification of region-of-interest (ROI) interactions, substantially improving classification accuracy and enabling the discovery of biomarkers relevant to neurodevelopmental disorders (Izadi et al., 19 Aug 2025).
1. Architectural Composition and Integration
The MoE module operates within a Transformer-based pipeline for fMRI analysis. Each subject provides an input FC matrix , where the th row represents the connectivity profile of ROI .
A shared multilayer perceptron (MLP) maps each to a -dimensional embedding: The sequence is processed through Transformer layers, yielding contextualized tokens 0.
The MoE decoder introduces a dimensionality-reduction MLP, producing lower-dimensional representations 1. The resulting matrix 2 (3) is provided to 4 expert branches, each comprising a sparse attention pooling module and an independent classifier. A gating network computes selection weights across these experts, producing a final prediction as a weighted sum of their outputs.
2. Expert Branch Mechanism and Mathematical Formalism
2.1. Expert-Specific Attention Pooling
Each expert 5 applies its own attention scoring MLP to each ROI token: 6 Top-7 ROIs for expert 8 are selected: 9 A masked softmax defines the attention pool: 0 Pooled expert embedding: 1 Each expert's classifier, an independent two-layer MLP with GELU nonlinearity, maps embedding 2 to class logits:
3
where 4 (ASD, HC).
2.2. Gating and Combination
The set of all reduced ROI tokens for each subject is flattened: 5 A gating network 6 computes expert logits 7, which are normalized: 8 The final output logits are: 9 This mechanism allows the model to adaptively assign influence to each expert based on global context.
3. Classifier Expert Heads
Each expert uses an independent MLP head for classification, mapping 0 with no parameter sharing, thereby encouraging specialization. The MLPs typically have two layers with inner dimension 128 and GELU activations. There is no cross-expert weight-sharing, facilitating the learning of complementary, non-redundant ROI subsets relevant for ASD and healthy control discrimination.
4. Training Methodology and Regularization
The model is optimized using binary cross-entropy loss
1
To mitigate expert collapse, an expert-load regularizer is employed, penalizing high coefficient of variation in total gating weight per expert: 2 The total loss is
3
with 4.
Optimization uses Adam with weight decay (5) and early stopping based on validation AUROC. Dropout (6) is applied within MLPs for further regularization.
5. Empirical Performance and Comparative Analysis
On the ABIDE dataset, ASDFormer with MoE outperformed strong baselines such as Com-BrainTF, BrainNetCNN, and FBNETGEN. Notably, it outperforms single-expert pooling-classifier decoders along key metrics:
| Method | AUROC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|
| ASDFormer (MoE + Transformer) | 7\% | 8\% | 9\% | 0\% |
| Single-expert pooling | 1\% | 2\% | 3\% | 4\% |
These results demonstrate the efficacy of the MoE architecture in both overall classification and balanced sensitivity/specificity.
6. Interpretability and Biomarker Identification
The MoE structure yields direct interpretability mechanisms. The gating weights 5 indicate expert dominance (e.g., healthy controls with 6, ASD with 7). Each expert’s top-8 ROIs are determined empirically (9 for expert 1, 0 for expert 2).
The product 1 yields a signed importance score per ROI in each subject, enabling individual-level attribution. Analysis reveals that the model identifies salient connectivity features:
- Sensorimotor Network (SMN) ↔ Fronto-Parietal (FPN), Default Mode (DMN), and Limbic cross-network interactions
- DMN intra-network dysconnectivity
- Cerebellum and subcortical (CS/SB) cross-talk
These findings align with established fMRI literature on ASD and verify the MoE's utility for both predictive and mechanistic biomarker discovery.
7. Context and Implementation Guidance
The modularity of the Mixture of Pooling-Classifier Experts module facilitates integration into any Transformer-based FC classifier. Code and further implementation details are provided in the ASDFormer repository. The architecture is generalizable to other domains where sparse, interpretable attention over high-dimensional feature sets is advantageous, particularly in connectomics and brain disorder classification settings (Izadi et al., 19 Aug 2025).