Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mixture of Pooling-Classifier Experts (MoE)

Updated 3 July 2026
  • MoE is a neural module that integrates multiple specialized expert branches within Transformer pipelines to analyze high-dimensional fMRI connectivity data.
  • It employs expert-specific attention pooling and a gating mechanism to adaptively weight contributions, thereby enhancing classification accuracy and interpretability.
  • Empirical results on the ABIDE dataset demonstrate superior performance over single-expert decoders, highlighting improved sensitivity, specificity, and overall diagnostic accuracy.

A Mixture of Pooling-Classifier Experts (MoE) is a neural module designed to enable Transformers to integrate multiple specialized expert branches, each focusing on distinct patterns within high-dimensional networks such as functional connectivity (FC) matrices from functional MRI. In the context of ASDFormer, MoE provides adaptive, interpretable pooling and classification of region-of-interest (ROI) interactions, substantially improving classification accuracy and enabling the discovery of biomarkers relevant to neurodevelopmental disorders (Izadi et al., 19 Aug 2025).

1. Architectural Composition and Integration

The MoE module operates within a Transformer-based pipeline for fMRI analysis. Each subject bb provides an input FC matrix Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}, where the iith row xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N} represents the connectivity profile of ROI ii.

A shared multilayer perceptron (MLP) maps each xb,i\mathbf x_{b,i} to a dd-dimensional embedding: zb,i=LayerNorm(MLP(xb,i))\mathbf z_{b,i} = \mathrm{LayerNorm}(\mathrm{MLP}(\mathbf x_{b,i})) The sequence {zb,1,…,zb,N}\{\mathbf z_{b,1},\dots, \mathbf z_{b,N}\} is processed through LL Transformer layers, yielding contextualized tokens Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}0.

The MoE decoder introduces a dimensionality-reduction MLP, producing lower-dimensional representations Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}1. The resulting matrix Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}2 (Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}3) is provided to Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}4 expert branches, each comprising a sparse attention pooling module and an independent classifier. A gating network computes selection weights across these experts, producing a final prediction as a weighted sum of their outputs.

2. Expert Branch Mechanism and Mathematical Formalism

2.1. Expert-Specific Attention Pooling

Each expert Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}5 applies its own attention scoring MLP to each ROI token: Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}6 Top-Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}7 ROIs for expert Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}8 are selected: Xb∈RN×N\mathbf X_b \in \mathbb R^{N \times N}9 A masked softmax defines the attention pool: ii0 Pooled expert embedding: ii1 Each expert's classifier, an independent two-layer MLP with GELU nonlinearity, maps embedding ii2 to class logits:

ii3

where ii4 (ASD, HC).

2.2. Gating and Combination

The set of all reduced ROI tokens for each subject is flattened: ii5 A gating network ii6 computes expert logits ii7, which are normalized: ii8 The final output logits are: ii9 This mechanism allows the model to adaptively assign influence to each expert based on global context.

3. Classifier Expert Heads

Each expert uses an independent MLP head for classification, mapping xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}0 with no parameter sharing, thereby encouraging specialization. The MLPs typically have two layers with inner dimension 128 and GELU activations. There is no cross-expert weight-sharing, facilitating the learning of complementary, non-redundant ROI subsets relevant for ASD and healthy control discrimination.

4. Training Methodology and Regularization

The model is optimized using binary cross-entropy loss

xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}1

To mitigate expert collapse, an expert-load regularizer is employed, penalizing high coefficient of variation in total gating weight per expert: xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}2 The total loss is

xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}3

with xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}4.

Optimization uses Adam with weight decay (xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}5) and early stopping based on validation AUROC. Dropout (xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}6) is applied within MLPs for further regularization.

5. Empirical Performance and Comparative Analysis

On the ABIDE dataset, ASDFormer with MoE outperformed strong baselines such as Com-BrainTF, BrainNetCNN, and FBNETGEN. Notably, it outperforms single-expert pooling-classifier decoders along key metrics:

Method AUROC Accuracy Sensitivity Specificity
ASDFormer (MoE + Transformer) xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}7\% xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}8\% xb,i∈RN\mathbf x_{b,i} \in \mathbb R^{N}9\% ii0\%
Single-expert pooling ii1\% ii2\% ii3\% ii4\%

These results demonstrate the efficacy of the MoE architecture in both overall classification and balanced sensitivity/specificity.

6. Interpretability and Biomarker Identification

The MoE structure yields direct interpretability mechanisms. The gating weights ii5 indicate expert dominance (e.g., healthy controls with ii6, ASD with ii7). Each expert’s top-ii8 ROIs are determined empirically (ii9 for expert 1, xb,i\mathbf x_{b,i}0 for expert 2).

The product xb,i\mathbf x_{b,i}1 yields a signed importance score per ROI in each subject, enabling individual-level attribution. Analysis reveals that the model identifies salient connectivity features:

  • Sensorimotor Network (SMN) ↔ Fronto-Parietal (FPN), Default Mode (DMN), and Limbic cross-network interactions
  • DMN intra-network dysconnectivity
  • Cerebellum and subcortical (CS/SB) cross-talk

These findings align with established fMRI literature on ASD and verify the MoE's utility for both predictive and mechanistic biomarker discovery.

7. Context and Implementation Guidance

The modularity of the Mixture of Pooling-Classifier Experts module facilitates integration into any Transformer-based FC classifier. Code and further implementation details are provided in the ASDFormer repository. The architecture is generalizable to other domains where sparse, interpretable attention over high-dimensional feature sets is advantageous, particularly in connectomics and brain disorder classification settings (Izadi et al., 19 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture of Pooling-Classifier Experts (MoE).