Graph-Guided Clustering Mixture-of-Experts CNN-GRU
- The paper introduces a novel unified framework that decodes motor execution and imagery EEG signals using advanced denoising, graph tokenization, and expert fusion techniques.
- It leverages ICA-based preprocessing, graph attention encoding, and unsupervised spectral clustering to achieve robust spatial-temporal feature extraction and cross-subject generalization.
- Empirical evaluations reveal state-of-the-art accuracies (up to 99.61%) on multiple EEG datasets, underscoring the framework's effectiveness and interpretability.
Graph-guided Clustering Mixture-of-Experts CNN-GRU (GCMCG) is a unified framework specifically designed for decoding motor execution (ME) and motor imagery (MI) electroencephalogram (EEG) signals in brain-computer interface (BCI) applications. It addresses critical challenges in EEG-based BCI, including the complex spatio-temporal structure of EEG, low signal-to-noise ratio, and the need for robust cross-subject generalization. GCMCG integrates dense preprocessing, learnable graph-based encoding, unsupervised spectral clustering for functional region decomposition, cluster-aware and global expert networks, an entropy-regularized mixture-of-experts (MoE) fusion scheme, and a multi-stage, class-balanced training regime. Its architecture supports multi-paradigm, multi-task, and cross-subject EEG decoding, yielding state-of-the-art performance across multiple public datasets and experimental protocols (Chen et al., 29 Nov 2025).
1. Preprocessing via ICA and Wavelet Threshold Denoising
The initial stage leverages a robust denoising pipeline, processing raw EEG (with channels, time samples):
- Independent Component Analysis (ICA): The EEG is decomposed as , where is a mixing matrix and contains independent components . FastICA is employed for unmixing.
- Kurtosis-based Rejection: Components with kurtosis 0.5 are discarded to suppress artifacts.
- Wavelet-Threshold Denoising: Each remaining undergoes a level- Daubechies-4 DWT, with a non-linear shrinkage applied to detail coefficients:
Denoised components are reconstructed via inverse DWT.
- Signal Reconstruction and Normalization: Final denoised EEG is , with channel-wise z-score normalization yielding .
This preprocessing ensures robust noise suppression and channel-level normalization, thereby improving subsequent graph and expert modeling stages.
2. Graph Tokenization and Attention Encoding
A graph-based module dynamically models electrode relationships and encodes spatial topologies:
- Graph Construction: Nodes represent electrodes, with adjacency based on the 8-connected neighborhood of the 10–20 arrangement.
- Feature Initialization: Each node is initialized with an embedding ; a mask allows configurable electrode dropout.
- Graph Attention Network (GAT): Multi-layer, multi-head GATs yield relational node embeddings through iterative neighbor aggregation:
- Unnormalized attention: \
- Normalized attention: \
- Node update: \
Multi-head concatenation produces the final graph embedding matrix . This stage captures spatial locality and dynamic electrode dependencies.
3. Unsupervised Spectral Clustering of Functional Regions
An unsupervised spectral clustering algorithm decomposes the electrode graph into functionally coherent regions:
Region Discovery: Correlation matrix is computed from rows of , yielding Laplacian with .
Dimensionality Selection: Number of regions is adaptively selected by maximizing the eigengap in the spectrum of .
K-Means Clustering: The first Laplacian eigenvectors form matrix ; k-means assigns each channel to one of clusters, representing putative "functional regions."
Mask Generation: Soft masks route standardized EEG channels to their respective expert networks.
This step enables interpretable decomposition aligned with functional brain topography and adaptive region assignment in a subject- and paradigm-agnostic manner.
4. Clustered and Global Expert Networks
The architecture features expert networks, each extracting complementary spatio-temporal features:
Cluster-specific Experts: For each region :
- Input: Masked EEG
- 1D convolution: (kernel size 3)
- GRU backbone: Processes temporal features; output
- Spatial Expert: Operates on the sequence of node embeddings , extracting spatial relational features via a GRU.
- Global Expert: Processes the full standardized EEG using the same Conv1D+GRU pipeline as regional experts.
Collectively, these expert feature vectors are concatenated for downstream fusion.
5. Entropy-Regularized Mixture-of-Experts Fusion
GCMCG employs a gated MoE mechanism for adaptive feature fusion:
- Fusion Vector: All expert outputs are concatenated into .
- Gating Network: A -layer feedforward projection calculates unnormalized gate activations: , with .
- Gate Normalization: Final weights are , encouraging sparse expert selection (as ).
- Entropy Regularization: An entropy-based penalty,
promotes decisiveness in expert contributions.
- Fused Representation: The downstream feature is .
This fusion model allows adaptive weighting of local and global experts, optimizing for the current input's spatial-temporal profile.
6. Multi-Stage Class-Weighted Training Procedure
Learning in GCMCG proceeds through a structured, three-stage process:
- Stage 1 (Pre-training): The full model is optimized using standard cross-entropy loss with uniform random sampling.
- Stage 2 (Fine-tuning for Class Balance): Backbone (GAT+experts) is frozen. Progressively Balanced Sampler overweights minority classes, and Focal Loss, with , is employed for improved minority generalization.
- Stage 3 (Learnable Weight Scaling): All layers except a per-class scaling vector are frozen. Final logits for class are .
These training stages are designed to counteract class imbalance, promote robust cross-subject generalization, and increase model compositionality across experimental paradigms.
7. Empirical Evaluation and Ablation Insights
GCMCG was assessed across three benchmark datasets:
| Dataset | Subjects | Channels | Classes | Accuracy (Top1) |
|---|---|---|---|---|
| EEGmmidb-BCI2000 | 109 | 64 | 9 (ME/MI) | 86.60% |
| BCI-IV 2a | 9 | 22 | 4 (MI) | 98.57% |
| M3CV | 106 | 64 | 3 (ME) | 99.61% |
Further findings on EEGmmidb-BCI2000 include Macro-Recall of 79.41%, Macro-Precision 83.07%, F1-score 0.81, and Cohen’s . Ablation studies confirm the importance of the graph encoder (AUC drops by 7% if removed) and spectral clustering (removal causes a significant decline). CNN-GRU expert backbones outperform alternative designs under fixed parameter budgets. Visualizations (t-SNE, confusion matrices, ROC curves) demonstrate class-separable, robust representations and focus of learned tokenizer edges on sensorimotor areas (notably C3–CP3, C4–CP4).
8. Implementation Notes and Hyperparameter Settings
The framework is implemented in PyTorch 2.5.1, optimized for Nvidia A100 GPUs. Training uses AdamW with a learning rate of 1e-3 (warmup 10 epochs, cosine decay), entropy regularization strength , batch size 64, dropout 0.5 in the gating network, weight decay 1e-5, and 200 total epochs. Progressive sampling and careful gating regularization are integral to the training pipeline.
GCMCG introduces a hybrid pipeline integrating advanced signal denoising, graph-based functional region decomposition, modular CNN–GRU expertise, and entropy-regularized MoE fusion for adaptive, interpretable, and generalizable EEG decoding. Its multi-paradigm performance and comprehensive ablation analysis establish it as an effective solution framework for next-generation BCI systems (Chen et al., 29 Nov 2025).