Papers
Topics
Authors
Recent
Search
2000 character limit reached

Graph-Guided Clustering Mixture-of-Experts CNN-GRU

Updated 7 December 2025
  • The paper introduces a novel unified framework that decodes motor execution and imagery EEG signals using advanced denoising, graph tokenization, and expert fusion techniques.
  • It leverages ICA-based preprocessing, graph attention encoding, and unsupervised spectral clustering to achieve robust spatial-temporal feature extraction and cross-subject generalization.
  • Empirical evaluations reveal state-of-the-art accuracies (up to 99.61%) on multiple EEG datasets, underscoring the framework's effectiveness and interpretability.

Graph-guided Clustering Mixture-of-Experts CNN-GRU (GCMCG) is a unified framework specifically designed for decoding motor execution (ME) and motor imagery (MI) electroencephalogram (EEG) signals in brain-computer interface (BCI) applications. It addresses critical challenges in EEG-based BCI, including the complex spatio-temporal structure of EEG, low signal-to-noise ratio, and the need for robust cross-subject generalization. GCMCG integrates dense preprocessing, learnable graph-based encoding, unsupervised spectral clustering for functional region decomposition, cluster-aware and global expert networks, an entropy-regularized mixture-of-experts (MoE) fusion scheme, and a multi-stage, class-balanced training regime. Its architecture supports multi-paradigm, multi-task, and cross-subject EEG decoding, yielding state-of-the-art performance across multiple public datasets and experimental protocols (Chen et al., 29 Nov 2025).

1. Preprocessing via ICA and Wavelet Threshold Denoising

The initial stage leverages a robust denoising pipeline, processing raw EEG XRC×SX \in \mathbb{R}^{C \times S} (with CC channels, SS time samples):

  • Independent Component Analysis (ICA): The EEG is decomposed as X=ΦSX = \Phi S, where ΦRC×C\Phi \in \mathbb{R}^{C \times C} is a mixing matrix and SS contains independent components sis_i. FastICA is employed for unmixing.
  • Kurtosis-based Rejection: Components with kurtosis <<0.5 are discarded to suppress artifacts.
  • Wavelet-Threshold Denoising: Each remaining sis_i undergoes a level-jj Daubechies-4 DWT, with a non-linear shrinkage applied to detail coefficients:

c^D={cD(1e(λTcD)/1.5),cD>λT 0,cDλT cD(e(λT+cD)/1.51),cD<λT\hat c_D = \begin{cases} c_D(1 - e^{(\lambda_T - c_D)/1.5}), & c_D > \lambda_T \ 0, & |c_D| \leq \lambda_T \ -c_D(e^{(\lambda_T + c_D)/1.5} - 1), & c_D < -\lambda_T \end{cases}

Denoised components s^i\hat s_i are reconstructed via inverse DWT.

  • Signal Reconstruction and Normalization: Final denoised EEG is X^=ΦS^\hat X = \Phi\,\hat S, with channel-wise z-score normalization yielding ZRC×SZ \in \mathbb{R}^{C \times S}.

This preprocessing ensures robust noise suppression and channel-level normalization, thereby improving subsequent graph and expert modeling stages.

2. Graph Tokenization and Attention Encoding

A graph-based module dynamically models electrode relationships and encodes spatial topologies:

  • Graph Construction: Nodes V={1,,C}V = \{1,\dots,C\} represent electrodes, with adjacency J(i)J(i) based on the 8-connected neighborhood of the 10–20 arrangement.
  • Feature Initialization: Each node is initialized with an embedding Θ0RC×F\Theta^0 \in \mathbb{R}^{C \times F}; a mask M{0,1}CM \in \{0,1\}^C allows configurable electrode dropout.
  • Graph Attention Network (GAT): Multi-layer, multi-head GATs yield relational node embeddings through iterative neighbor aggregation:
    • Unnormalized attention: \

    eijt=LeakyReLU(a[WgatΘit1WgatΘjt1])e_{ij}^t = \mathrm{LeakyReLU}(a^\top [W_\mathrm{gat} \Theta_i^{t-1} \Vert W_\mathrm{gat} \Theta_j^{t-1}]) - Normalized attention: \

    αijt=exp(eijt)kJ(i)exp(eikt)\alpha_{ij}^t = \frac{\exp(e_{ij}^t)}{\sum_{k \in J(i)} \exp(e_{ik}^t)} - Node update: \

    Θit=σ(jJ(i)αijtWgatΘjt1)\Theta_i^t = \sigma\left(\sum_{j \in J(i)} \alpha_{ij}^t W_\mathrm{gat} \Theta_j^{t-1}\right)

Multi-head concatenation produces the final graph embedding matrix ΘRC×D\Theta' \in \mathbb{R}^{C \times D}. This stage captures spatial locality and dynamic electrode dependencies.

3. Unsupervised Spectral Clustering of Functional Regions

An unsupervised spectral clustering algorithm decomposes the electrode graph into functionally coherent regions:

  • Region Discovery: Correlation matrix RR is computed from rows of Θ\Theta', yielding Laplacian L=DRL = D - R with D=diag(R1)D = \mathrm{diag}(R\mathbf{1}).

  • Dimensionality Selection: Number of regions KK is adaptively selected by maximizing the eigengap in the spectrum of LL.

  • K-Means Clustering: The first KK Laplacian eigenvectors form matrix URC×KU \in \mathbb{R}^{C \times K}; k-means assigns each channel to one of KK clusters, representing putative "functional regions."

  • Mask Generation: Soft masks Mk{0,1}CM_k \in \{0,1\}^C route standardized EEG channels to their respective expert networks.

This step enables interpretable decomposition aligned with functional brain topography and adaptive region assignment in a subject- and paradigm-agnostic manner.

4. Clustered and Global Expert Networks

The architecture features (K+2)(K+2) expert networks, each extracting complementary spatio-temporal features:

  • Cluster-specific Experts: For each region kk:

    • Input: Masked EEG Zk=MkZZ_k = M_k \odot Z
    • 1D convolution: Uk=Conv1D(Zk)U_k = \mathrm{Conv1D}(Z_k) (kernel size 3)
    • GRU backbone: Processes temporal features; output vk=hSR2Dv_k = h_{S'} \in \mathbb{R}^{2D}
  • Spatial Expert: Operates on the sequence of node embeddings Θ\Theta', extracting spatial relational features via a GRU.
  • Global Expert: Processes the full standardized EEG ZZ using the same Conv1D+GRU pipeline as regional experts.

Collectively, these expert feature vectors are concatenated for downstream fusion.

5. Entropy-Regularized Mixture-of-Experts Fusion

GCMCG employs a gated MoE mechanism for adaptive feature fusion:

  • Fusion Vector: All expert outputs are concatenated into v^=[vspatialvtemporalv1vK]R(K+2)2D\hat v = [v_\text{spatial} \Vert v_\text{temporal} \Vert v_1 \Vert \ldots \Vert v_K] \in \mathbb{R}^{(K+2)2D}.
  • Gating Network: A TT-layer feedforward projection calculates unnormalized gate activations: gt=exp(fgatet(Wgatetgt1+bgatet))g^t = \exp(f^t_\text{gate}(W^t_\text{gate} g^{t-1} + b^t_\text{gate})), with g0=v^g^0 = \hat v.
  • Gate Normalization: Final weights are αgate=g1+i=1K+2gi\alpha_\text{gate} = \frac{g'}{1 + \sum_{i=1}^{K+2}g'_i}, encouraging sparse expert selection (as iαi<1\sum_i \alpha_i < 1).
  • Entropy Regularization: An entropy-based penalty,

Lgate=1Bb=1Bi=1K+2αb,ilog(αb,i+ε)\mathcal{L}_\text{gate} = -\frac{1}{B} \sum_{b=1}^B \sum_{i=1}^{K+2} \alpha_{b,i} \log(\alpha_{b,i} + \varepsilon)

promotes decisiveness in expert contributions.

  • Fused Representation: The downstream feature is vfused=i=1K+2αiv^iv_\text{fused} = \sum_{i=1}^{K+2} \alpha_i \hat v_i.

This fusion model allows adaptive weighting of local and global experts, optimizing for the current input's spatial-temporal profile.

6. Multi-Stage Class-Weighted Training Procedure

Learning in GCMCG proceeds through a structured, three-stage process:

  • Stage 1 (Pre-training): The full model is optimized using standard cross-entropy loss with uniform random sampling.
  • Stage 2 (Fine-tuning for Class Balance): Backbone (GAT+experts) is frozen. Progressively Balanced Sampler overweights minority classes, and Focal Loss, FL(pt)=(1pt)γlog(pt)\mathrm{FL}(p_t) = -(1-p_t)^\gamma \log(p_t) with γ=2\gamma=2, is employed for improved minority generalization.
  • Stage 3 (Learnable Weight Scaling): All layers except a per-class scaling vector γRQ\gamma \in \mathbb{R}^Q are frozen. Final logits for class qq are y^=W^(γvfused)+b^\hat y = \hat W\, (\gamma \odot v_\text{fused}) + \hat b.

These training stages are designed to counteract class imbalance, promote robust cross-subject generalization, and increase model compositionality across experimental paradigms.

7. Empirical Evaluation and Ablation Insights

GCMCG was assessed across three benchmark datasets:

Dataset Subjects Channels Classes Accuracy (Top1)
EEGmmidb-BCI2000 109 64 9 (ME/MI) 86.60%
BCI-IV 2a 9 22 4 (MI) 98.57%
M3CV 106 64 3 (ME) 99.61%

Further findings on EEGmmidb-BCI2000 include Macro-Recall of 79.41%, Macro-Precision 83.07%, F1-score 0.81, and Cohen’s κ=0.82\kappa = 0.82. Ablation studies confirm the importance of the graph encoder (AUC drops by 7% if removed) and spectral clustering (removal causes a significant decline). CNN-GRU expert backbones outperform alternative designs under fixed parameter budgets. Visualizations (t-SNE, confusion matrices, ROC curves) demonstrate class-separable, robust representations and focus of learned tokenizer edges on sensorimotor areas (notably C3–CP3, C4–CP4).

8. Implementation Notes and Hyperparameter Settings

The framework is implemented in PyTorch 2.5.1, optimized for Nvidia A100 GPUs. Training uses AdamW with a learning rate of 1e-3 (warmup 10 epochs, cosine decay), entropy regularization strength λgate=1e-4\lambda_\text{gate}=1\textrm{e-4}, batch size 64, dropout 0.5 in the gating network, weight decay 1e-5, and 200 total epochs. Progressive sampling and careful gating regularization are integral to the training pipeline.


GCMCG introduces a hybrid pipeline integrating advanced signal denoising, graph-based functional region decomposition, modular CNN–GRU expertise, and entropy-regularized MoE fusion for adaptive, interpretable, and generalizable EEG decoding. Its multi-paradigm performance and comprehensive ablation analysis establish it as an effective solution framework for next-generation BCI systems (Chen et al., 29 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph-guided Clustering Mixture-of-Experts CNN-GRU (GCMCG).