Graph-Guided Clustering Mixture-of-Experts CNN-GRU

Updated 7 December 2025

The paper introduces a novel unified framework that decodes motor execution and imagery EEG signals using advanced denoising, graph tokenization, and expert fusion techniques.
It leverages ICA-based preprocessing, graph attention encoding, and unsupervised spectral clustering to achieve robust spatial-temporal feature extraction and cross-subject generalization.
Empirical evaluations reveal state-of-the-art accuracies (up to 99.61%) on multiple EEG datasets, underscoring the framework's effectiveness and interpretability.

Graph-guided Clustering Mixture-of-Experts CNN-GRU (GCMCG) is a unified framework specifically designed for decoding motor execution (ME) and motor imagery (MI) electroencephalogram (EEG) signals in brain-computer interface (BCI) applications. It addresses critical challenges in EEG-based BCI, including the complex spatio-temporal structure of EEG, low signal-to-noise ratio, and the need for robust cross-subject generalization. GCMCG integrates dense preprocessing, learnable graph-based encoding, unsupervised spectral clustering for functional region decomposition, cluster-aware and global expert networks, an entropy-regularized mixture-of-experts (MoE) fusion scheme, and a multi-stage, class-balanced training regime. Its architecture supports multi-paradigm, multi-task, and cross-subject EEG decoding, yielding state-of-the-art performance across multiple public datasets and experimental protocols (Chen et al., 29 Nov 2025).

1. Preprocessing via ICA and Wavelet Threshold Denoising

The initial stage leverages a robust denoising pipeline, processing raw EEG $X \in \mathbb{R}^{C \times S}$ (with $C$ channels, $S$ time samples):

Independent Component Analysis (ICA): The EEG is decomposed as $X = \Phi S$ , where $\Phi \in \mathbb{R}^{C \times C}$ is a mixing matrix and $S$ contains independent components $s_i$ . FastICA is employed for unmixing.
Kurtosis-based Rejection: Components with kurtosis $<$ 0.5 are discarded to suppress artifacts.
Wavelet-Threshold Denoising: Each remaining $s_i$ undergoes a level- $j$ Daubechies-4 DWT, with a non-linear shrinkage applied to detail coefficients:

$\hat c_D = \begin{cases} c_D(1 - e^{(\lambda_T - c_D)/1.5}), & c_D > \lambda_T \ 0, & |c_D| \leq \lambda_T \ -c_D(e^{(\lambda_T + c_D)/1.5} - 1), & c_D < -\lambda_T \end{cases}$

Denoised components $\hat s_i$ are reconstructed via inverse DWT.

Signal Reconstruction and Normalization: Final denoised EEG is $\hat X = \Phi\,\hat S$ , with channel-wise z-score normalization yielding $Z \in \mathbb{R}^{C \times S}$ .

This preprocessing ensures robust noise suppression and channel-level normalization, thereby improving subsequent graph and expert modeling stages.

2. Graph Tokenization and Attention Encoding

A graph-based module dynamically models electrode relationships and encodes spatial topologies:

Graph Construction: Nodes $V = \{1,\dots,C\}$ represent electrodes, with adjacency $J(i)$ based on the 8-connected neighborhood of the 10–20 arrangement.
Feature Initialization: Each node is initialized with an embedding $\Theta^0 \in \mathbb{R}^{C \times F}$ ; a mask $M \in \{0,1\}^C$ allows configurable electrode dropout.
Graph Attention Network (GAT): Multi-layer, multi-head GATs yield relational node embeddings through iterative neighbor aggregation:
- Unnormalized attention: \
$e_{ij}^t = \mathrm{LeakyReLU}(a^\top [W_\mathrm{gat} \Theta_i^{t-1} \Vert W_\mathrm{gat} \Theta_j^{t-1}])$ - Normalized attention: \

$\alpha_{ij}^t = \frac{\exp(e_{ij}^t)}{\sum_{k \in J(i)} \exp(e_{ik}^t)}$ - Node update: \

$\Theta_i^t = \sigma\left(\sum_{j \in J(i)} \alpha_{ij}^t W_\mathrm{gat} \Theta_j^{t-1}\right)$

Multi-head concatenation produces the final graph embedding matrix $\Theta' \in \mathbb{R}^{C \times D}$ . This stage captures spatial locality and dynamic electrode dependencies.

3. Unsupervised Spectral Clustering of Functional Regions

An unsupervised spectral clustering algorithm decomposes the electrode graph into functionally coherent regions:

Region Discovery: Correlation matrix $R$ is computed from rows of $\Theta'$ , yielding Laplacian $L = D - R$ with $D = \mathrm{diag}(R\mathbf{1})$ .
Dimensionality Selection: Number of regions $K$ is adaptively selected by maximizing the eigengap in the spectrum of $L$ .
K-Means Clustering: The first $K$ Laplacian eigenvectors form matrix $U \in \mathbb{R}^{C \times K}$ ; k-means assigns each channel to one of $K$ clusters, representing putative "functional regions."
Mask Generation: Soft masks $M_k \in \{0,1\}^C$ route standardized EEG channels to their respective expert networks.

This step enables interpretable decomposition aligned with functional brain topography and adaptive region assignment in a subject- and paradigm-agnostic manner.

4. Clustered and Global Expert Networks

The architecture features $(K+2)$ expert networks, each extracting complementary spatio-temporal features:

Cluster-specific Experts: For each region $k$ :
- Input: Masked EEG $Z_k = M_k \odot Z$
- 1D convolution: $U_k = \mathrm{Conv1D}(Z_k)$ (kernel size 3)
- GRU backbone: Processes temporal features; output $v_k = h_{S'} \in \mathbb{R}^{2D}$
Spatial Expert: Operates on the sequence of node embeddings $\Theta'$ , extracting spatial relational features via a GRU.
Global Expert: Processes the full standardized EEG $Z$ using the same Conv1D+GRU pipeline as regional experts.

Collectively, these expert feature vectors are concatenated for downstream fusion.

5. Entropy-Regularized Mixture-of-Experts Fusion

GCMCG employs a gated MoE mechanism for adaptive feature fusion:

Fusion Vector: All expert outputs are concatenated into $\hat v = [v_\text{spatial} \Vert v_\text{temporal} \Vert v_1 \Vert \ldots \Vert v_K] \in \mathbb{R}^{(K+2)2D}$ .
Gating Network: A $T$ -layer feedforward projection calculates unnormalized gate activations: $g^t = \exp(f^t_\text{gate}(W^t_\text{gate} g^{t-1} + b^t_\text{gate}))$ , with $g^0 = \hat v$ .
Gate Normalization: Final weights are $\alpha_\text{gate} = \frac{g'}{1 + \sum_{i=1}^{K+2}g'_i}$ , encouraging sparse expert selection (as $\sum_i \alpha_i < 1$ ).
Entropy Regularization: An entropy-based penalty,

$\mathcal{L}_\text{gate} = -\frac{1}{B} \sum_{b=1}^B \sum_{i=1}^{K+2} \alpha_{b,i} \log(\alpha_{b,i} + \varepsilon)$

promotes decisiveness in expert contributions.

Fused Representation: The downstream feature is $v_\text{fused} = \sum_{i=1}^{K+2} \alpha_i \hat v_i$ .

This fusion model allows adaptive weighting of local and global experts, optimizing for the current input's spatial-temporal profile.

6. Multi-Stage Class-Weighted Training Procedure

Learning in GCMCG proceeds through a structured, three-stage process:

Stage 1 (Pre-training): The full model is optimized using standard cross-entropy loss with uniform random sampling.
Stage 2 (Fine-tuning for Class Balance): Backbone (GAT+experts) is frozen. Progressively Balanced Sampler overweights minority classes, and Focal Loss, $\mathrm{FL}(p_t) = -(1-p_t)^\gamma \log(p_t)$ with $\gamma=2$ , is employed for improved minority generalization.
Stage 3 (Learnable Weight Scaling): All layers except a per-class scaling vector $\gamma \in \mathbb{R}^Q$ are frozen. Final logits for class $q$ are $\hat y = \hat W\, (\gamma \odot v_\text{fused}) + \hat b$ .

These training stages are designed to counteract class imbalance, promote robust cross-subject generalization, and increase model compositionality across experimental paradigms.

7. Empirical Evaluation and Ablation Insights

GCMCG was assessed across three benchmark datasets:

Dataset	Subjects	Channels	Classes	Accuracy (Top1)
EEGmmidb-BCI2000	109	64	9 (ME/MI)	86.60%
BCI-IV 2a	9	22	4 (MI)	98.57%
M3CV	106	64	3 (ME)	99.61%

Further findings on EEGmmidb-BCI2000 include Macro-Recall of 79.41%, Macro-Precision 83.07%, F1-score 0.81, and Cohen’s $\kappa = 0.82$ . Ablation studies confirm the importance of the graph encoder (AUC drops by 7% if removed) and spectral clustering (removal causes a significant decline). CNN-GRU expert backbones outperform alternative designs under fixed parameter budgets. Visualizations (t-SNE, confusion matrices, ROC curves) demonstrate class-separable, robust representations and focus of learned tokenizer edges on sensorimotor areas (notably C3–CP3, C4–CP4).

8. Implementation Notes and Hyperparameter Settings

The framework is implemented in PyTorch 2.5.1, optimized for Nvidia A100 GPUs. Training uses AdamW with a learning rate of 1e-3 (warmup 10 epochs, cosine decay), entropy regularization strength $\lambda_\text{gate}=1\textrm{e-4}$ , batch size 64, dropout 0.5 in the gating network, weight decay 1e-5, and 200 total epochs. Progressive sampling and careful gating regularization are integral to the training pipeline.

GCMCG introduces a hybrid pipeline integrating advanced signal denoising, graph-based functional region decomposition, modular CNN–GRU expertise, and entropy-regularized MoE fusion for adaptive, interpretable, and generalizable EEG decoding. Its multi-paradigm performance and comprehensive ablation analysis establish it as an effective solution framework for next-generation BCI systems (Chen et al., 29 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

GCMCG: A Clustering-Aware Graph Attention and Expert Fusion Network for Multi-Paradigm, Multi-task, and Cross-Subject EEG Decoding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph-guided Clustering Mixture-of-Experts CNN-GRU (GCMCG).