Mixture of Pooling-Classifier Experts (MoE)

Updated 3 July 2026

MoE is a neural module that integrates multiple specialized expert branches within Transformer pipelines to analyze high-dimensional fMRI connectivity data.
It employs expert-specific attention pooling and a gating mechanism to adaptively weight contributions, thereby enhancing classification accuracy and interpretability.
Empirical results on the ABIDE dataset demonstrate superior performance over single-expert decoders, highlighting improved sensitivity, specificity, and overall diagnostic accuracy.

A Mixture of Pooling-Classifier Experts (MoE) is a neural module designed to enable Transformers to integrate multiple specialized expert branches, each focusing on distinct patterns within high-dimensional networks such as functional connectivity (FC) matrices from functional MRI. In the context of ASDFormer, MoE provides adaptive, interpretable pooling and classification of region-of-interest (ROI) interactions, substantially improving classification accuracy and enabling the discovery of biomarkers relevant to neurodevelopmental disorders (Izadi et al., 19 Aug 2025).

1. Architectural Composition and Integration

The MoE module operates within a Transformer-based pipeline for fMRI analysis. Each subject $b$ provides an input FC matrix $\mathbf X_b \in \mathbb R^{N \times N}$ , where the $i$ th row $\mathbf x_{b,i} \in \mathbb R^{N}$ represents the connectivity profile of ROI $i$ .

A shared multilayer perceptron (MLP) maps each $\mathbf x_{b,i}$ to a $d$ -dimensional embedding: $\mathbf z_{b,i} = \mathrm{LayerNorm}(\mathrm{MLP}(\mathbf x_{b,i}))$ The sequence $\{\mathbf z_{b,1},\dots, \mathbf z_{b,N}\}$ is processed through $L$ Transformer layers, yielding contextualized tokens $\mathbf X_b \in \mathbb R^{N \times N}$ 0.

The MoE decoder introduces a dimensionality-reduction MLP, producing lower-dimensional representations $\mathbf X_b \in \mathbb R^{N \times N}$ 1. The resulting matrix $\mathbf X_b \in \mathbb R^{N \times N}$ 2 ( $\mathbf X_b \in \mathbb R^{N \times N}$ 3) is provided to $\mathbf X_b \in \mathbb R^{N \times N}$ 4 expert branches, each comprising a sparse attention pooling module and an independent classifier. A gating network computes selection weights across these experts, producing a final prediction as a weighted sum of their outputs.

2. Expert Branch Mechanism and Mathematical Formalism

2.1. Expert-Specific Attention Pooling

Each expert $\mathbf X_b \in \mathbb R^{N \times N}$ 5 applies its own attention scoring MLP to each ROI token: $\mathbf X_b \in \mathbb R^{N \times N}$ 6 Top- $\mathbf X_b \in \mathbb R^{N \times N}$ 7 ROIs for expert $\mathbf X_b \in \mathbb R^{N \times N}$ 8 are selected: $\mathbf X_b \in \mathbb R^{N \times N}$ 9 A masked softmax defines the attention pool: $i$ 0 Pooled expert embedding: $i$ 1 Each expert's classifier, an independent two-layer MLP with GELU nonlinearity, maps embedding $i$ 2 to class logits:

$i$ 3

where $i$ 4 (ASD, HC).

2.2. Gating and Combination

The set of all reduced ROI tokens for each subject is flattened: $i$ 5 A gating network $i$ 6 computes expert logits $i$ 7, which are normalized: $i$ 8 The final output logits are: $i$ 9 This mechanism allows the model to adaptively assign influence to each expert based on global context.

3. Classifier Expert Heads

Each expert uses an independent MLP head for classification, mapping $\mathbf x_{b,i} \in \mathbb R^{N}$ 0 with no parameter sharing, thereby encouraging specialization. The MLPs typically have two layers with inner dimension 128 and GELU activations. There is no cross-expert weight-sharing, facilitating the learning of complementary, non-redundant ROI subsets relevant for ASD and healthy control discrimination.

4. Training Methodology and Regularization

The model is optimized using binary cross-entropy loss

$\mathbf x_{b,i} \in \mathbb R^{N}$ 1

To mitigate expert collapse, an expert-load regularizer is employed, penalizing high coefficient of variation in total gating weight per expert: $\mathbf x_{b,i} \in \mathbb R^{N}$ 2 The total loss is

$\mathbf x_{b,i} \in \mathbb R^{N}$ 3

with $\mathbf x_{b,i} \in \mathbb R^{N}$ 4.

Optimization uses Adam with weight decay ( $\mathbf x_{b,i} \in \mathbb R^{N}$ 5) and early stopping based on validation AUROC. Dropout ( $\mathbf x_{b,i} \in \mathbb R^{N}$ 6) is applied within MLPs for further regularization.

5. Empirical Performance and Comparative Analysis

On the ABIDE dataset, ASDFormer with MoE outperformed strong baselines such as Com-BrainTF, BrainNetCNN, and FBNETGEN. Notably, it outperforms single-expert pooling-classifier decoders along key metrics:

Method	AUROC	Accuracy	Sensitivity	Specificity
ASDFormer (MoE + Transformer)	$\mathbf x_{b,i} \in \mathbb R^{N}$ 7\%	$\mathbf x_{b,i} \in \mathbb R^{N}$ 8\%	$\mathbf x_{b,i} \in \mathbb R^{N}$ 9\%	$i$ 0\%
Single-expert pooling	$i$ 1\%	$i$ 2\%	$i$ 3\%	$i$ 4\%

These results demonstrate the efficacy of the MoE architecture in both overall classification and balanced sensitivity/specificity.

6. Interpretability and Biomarker Identification

The MoE structure yields direct interpretability mechanisms. The gating weights $i$ 5 indicate expert dominance (e.g., healthy controls with $i$ 6, ASD with $i$ 7). Each expert’s top- $i$ 8 ROIs are determined empirically ( $i$ 9 for expert 1, $\mathbf x_{b,i}$ 0 for expert 2).

The product $\mathbf x_{b,i}$ 1 yields a signed importance score per ROI in each subject, enabling individual-level attribution. Analysis reveals that the model identifies salient connectivity features:

Sensorimotor Network (SMN) ↔ Fronto-Parietal (FPN), Default Mode (DMN), and Limbic cross-network interactions
DMN intra-network dysconnectivity
Cerebellum and subcortical (CS/SB) cross-talk

These findings align with established fMRI literature on ASD and verify the MoE's utility for both predictive and mechanistic biomarker discovery.

7. Context and Implementation Guidance

The modularity of the Mixture of Pooling-Classifier Experts module facilitates integration into any Transformer-based FC classifier. Code and further implementation details are provided in the ASDFormer repository. The architecture is generalizable to other domains where sparse, interpretable attention over high-dimensional feature sets is advantageous, particularly in connectomics and brain disorder classification settings (Izadi et al., 19 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ASDFormer: A Transformer with Mixtures of Pooling-Classifier Experts for Robust Autism Diagnosis and Biomarker Discovery (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture of Pooling-Classifier Experts (MoE).