EEG Foundation Models

Updated 2 August 2025

EEG foundation models are large-scale machine learning architectures that learn universal representations from heterogeneous EEG data using self-supervised methods.
They integrate convolutional, transformer, and vector quantization techniques to effectively capture spatiotemporal dynamics and electrode-specific features.
These models enable robust, transferable performance across diverse EEG tasks, bolstering clinical diagnostics, brain-computer interfaces, and cross-device analysis.

Electroencephalography (EEG) foundation models are large-scale, generalizable machine learning architectures designed to learn universal representations from diverse EEG datasets. By leveraging massive unlabeled corpora and self-supervised learning objectives, these models aim to address the marked data scarcity, heterogeneity, and variable sensor configurations inherent in EEG research and practical brain-computer interface (BCI) deployments. EEG foundation models draw from architectural paradigms established in natural language processing and computer vision, but integrate domain-specific components—such as channel adaptation, frequency–temporal feature decoupling, and neurophysiological preprocessing—to accommodate the unique statistical and structural characteristics of electrophysiological signals.

1. Model Architectures and Representation Strategies

EEG foundation model architectures typically combine convolutional feature extraction, attention mechanisms (transformers, structured state space models), vector quantization, and explicit spatiotemporal modeling. Recent models implement these concepts in modular, highly scalable designs:

Hybrid Encoder-Decoder Pipelines: Neuro-GPT employs a two-stage pipeline with a convolution-based EEG encoder yielding chunked spatiotemporal embeddings, and a GPT-2-like transformer decoder handling auto-regressive mask reconstruction (Cui et al., 2023). EEGFormer integrates transformer encoders with channel-wise discrete vector quantization, followed by a lightweight decoder, to enforce patchwise universality and interpretability (Chen et al., 2024).
Tokenization and Quantization: Models such as EEGFormer and BioSerenity-E1 rely on transforming raw or preprocessed EEG into discrete tokens—summarizing multi-scale time and frequency features—through vector-quantized autoencoders. CodeBrain extends this idea with TFDual-Tokenizer, independently quantizing temporal and frequency features to expand the representation space and support cross-domain interpretability (Ma et al., 10 Jun 2025). Residual vector quantization, as in BrainOmni, enables robust tokenization across devices and modalities (Xiao et al., 18 May 2025).
Attention Mechanisms and Structured Modeling: ALFEE separates channel-wise attention (aggregating variable electrode configurations) from temporal attention (modeling sequence evolution), using cross-attention and gated feed-forward layers (Xiong et al., 7 May 2025). CBraMod employs a criss-cross transformer that decouples spatial and temporal dependencies via parallel attention streams and adapts asymmetric positional encoding through convolutional kernels (Wang et al., 2024). EEGSSM in CodeBrain couples structured global FFT convolutions (for long-range dependencies) with local sliding-window attention, mimicking small-world network properties observed in neural substrates.
Scale- and Region-aware Modeling: CSBrain introduces Cross-scale Spatiotemporal Tokenization (CST) and Structured Sparse Attention (SSA), stacking alternations of cross-scale token aggregation (across temporal windows and brain regions) with sparsely computed attention to capture both fine- and coarse-grained EEG dynamics (Zhou et al., 29 Jun 2025).
Electrode-wise and Sensor-aware Design: EEGPT restructures inputs to treat each electrode as a unit, enabling seamless handling of heterogeneous datasets (up to 138 electrodes) through electrode embeddings and unified modeling (Yue et al., 2024). BrainOmni generalizes this idea with a Sensor Encoder that incorporates sensor position, orientation, and type to facilitate EEG-MEG joint modeling and device-agnostic analysis.

2. Pretraining Methodologies and Objective Functions

The dominant training paradigm is large-scale self-supervised pretraining on massive, heterogeneous EEG (and in some cases, MEG) datasets. Notable methodologies include:

Masked Segment Reconstruction: Neuro-GPT, EEGFormer, BioSerenity-E1, and CBraMod employ objectives similar to BERT or masked autoencoders, where a proportion (ranging from 40–70%) of tokens or patches are masked and the model is trained to reconstruct the masked content from context (Cui et al., 2023, Chen et al., 2024, Bettinardi et al., 13 Mar 2025, Wang et al., 2024).
Autoregressive Prediction: EEGPT forgoes masking in favor of next-signal prediction with causal attention, optimizing a mean squared error reconstruction loss to better exploit temporal dependencies (Yue et al., 2024). This approach displays superior scaling and transfer performance compared to bidirectionally masked approaches, according to scaling law experiments.
Contrastive and Multi-level Objectives: LEAD introduces dual-level (sample and subject) InfoNCE-based contrastive learning losses, pushing the model to align augmentations of the same sample and cluster samples from the same subject, greatly improving subject-independent performance, especially for clinical endpoints such as Alzheimer’s detection (Wang et al., 2 Feb 2025).
Multi-Task and Instruction Tuning: NeuroLM demonstrates unified instruction-tuning, adapting a large LLM to handle multiple EEG tasks (e.g., event classification, emotion recognition, sleep staging) using prompt-based formulations and multi-task training objectives (Jiang et al., 2024).
Domain-specific Hybrid Strategies: MIRepNet combines masked token reconstruction with a concurrent supervised motor imagery (MI) classification loss, leading to paradigm-specific foundations capable of rapid adaptation with minimal target data (Liu et al., 27 Jul 2025).

3. Heterogeneity, Transfer, and Adaptation

Handling heterogeneity—across subjects, recording devices, electrode montages, and tasks—remains a focal point for EEG foundation models:

Channel Alignment and Sensor Adaptation: Models such as LEAD and MIRepNet utilize channel alignment and neurophysiologically-informed channel templates, employing interpolation and spatial weighting to align data from arbitrary headsets to standard electrode sets (Wang et al., 2 Feb 2025, Liu et al., 27 Jul 2025). BrainOmni’s Sensor Encoder explicitly encodes device-specific metadata, ensuring cross-device and cross-modality compatibility (Xiao et al., 18 May 2025).
Multi-Task Graph Networks: EEGPT and its Task-shared Electrode Graph facilitate sharing and learning of common spatial patterns across tasks and datasets with variable electrode sets, supporting joint multi-task transfer and adaptation (Yue et al., 2024).
Embedding and Token Interpretability: CodeBrain’s decoupled tokenizer (TFDual-Tokenizer) enables cross-domain analysis, providing interpretability into which temporal and frequency token pairs correspond to specific neural phenomena or clinical states (Ma et al., 10 Jun 2025). EEGFormer supports n-gram feature extraction, correlating discrete tokens to medically relevant EEG patterns for post-hoc interpretability (Chen et al., 2024).

4. Evaluation, Generalization, and Benchmark Performance

Comprehensive evaluation is central to validating EEG foundation models. Key findings include:

Broad Benchmark Coverage: Models such as CSBrain and ALFEE have been benchmarked on 11–16 datasets, covering tasks including motor imagery, emotion recognition, sleep staging, seizure detection, artifact classification, and abnormality detection (Zhou et al., 29 Jun 2025, Xiong et al., 7 May 2025). These models consistently outperform both specialist architectures (e.g., EEGNet, Conformer, LGGNet) and foundation model predecessors (e.g., BENDR, BIOT, LaBraM).
Robust Low-Data and Cross-Domain Transfer: Models pretrained on large, heterogeneous corpora—especially those employing self-supervised, contrastive, or multi-task objectives—show marked gains in scenarios with limited labeled data (e.g., BioSerenity-E1: +17% AUPRC with <10% data) and can sustain generalization on external datasets, including neonatal seizure and clinical abnormality datasets (Bettinardi et al., 13 Mar 2025, Chen et al., 2024).
Scaling Laws and Model Size: EEGPT and ALFEE demonstrate the scaling law trend: increases in model capacity and training dataset size yield monotonically lower pretraining losses and higher downstream task performance (Yue et al., 2024, Xiong et al., 7 May 2025). For EEGPT, a 1.1B parameter model is reported as the largest to date.
Task-Specific Foundations: MIRepNet, tailored for motor imagery, outperforms both generalist and specialist models and exhibits rapid adaptation with fewer than 30 trials per subject (Liu et al., 27 Jul 2025). LEAD, focused on Alzheimer’s detection, outperforms prior methods in both sample-level and subject-level F1 scores (Wang et al., 2 Feb 2025).

5. Technical Innovations and Key Mathematical Formulations

EEG foundation models integrate several technical and mathematical innovations to capture the complexity of neural signals:

Vector Quantization and Contrastive Losses: For quantized tokens $z_i$ and codebook vectors $v_j$ , quantization is performed as $z_i = \arg\min_j \|h_i - v_j\|_2$ , with loss functions combining reconstruction and codebook regularization:

$\text{Loss} = \|X_{rec} - X\|_2^2 + \sum_{i=1}^C \sum_{j=1}^N [\|sg(H_{i,j}) - v_{Z_{i,j}}\|_2^2 - \|H_{i,j} - sg(v_{Z_{i,j}})\|_2^2 ]$

where $sg[\cdot]$ is the stop-gradient (Chen et al., 2024).

Autoregressive and Masked Losses: Neuro-GPT applies a causal reconstruction loss:

$\mathcal{L} = \frac{1}{N-1} \sum_{i=2}^{N} \|\hat{y}_i - \mathcal{H}(D_i)\|_2^2$

where each $D_i$ is a data chunk and $\mathcal{H}$ is the encoder (Cui et al., 2023).

Cross-Scale and Sparse Attention: CSBrain’s stacked Cross-scale Spatiotemporal Tokenization and Structured Sparse Attention alternately aggregate multi-scale features and compute efficient dependency structures, avoiding the quadratic complexity of dense attention (Zhou et al., 29 Jun 2025).
Frequency-Temporal Losses: EEGM2 implements a spatiotemporal loss combining time-domain mean absolute error and frequency-domain spectral loss for sequence preservation (Hong et al., 25 Feb 2025).

6. Applications, Impact, and Emerging Directions

EEG foundation models are poised to broadly impact neuroscience, BCI, and clinical domains:

Clinical Diagnostics: Models such as LEAD and BioSerenity-E1 deliver substantial gains in EEG-based neurological disease diagnosis, enabling robust classification even with high subject variability and limited annotations (Wang et al., 2 Feb 2025, Bettinardi et al., 13 Mar 2025).
BCI and Assistive Technology: With rapid low-data adaptation and robust cross-device generalization (e.g., MIRepNet, BrainOmni), deployment in real-world BCI for stroke rehabilitation, assistive robotics, and cognitive monitoring is facilitated (Liu et al., 27 Jul 2025, Xiao et al., 18 May 2025).
Interpretable and Biologically-aligned Modeling: Advances in interpretability through token analysis, region-aware modeling, and biologically inspired architectural priors (e.g., small-world topologies in CodeBrain, anatomical parcellation in CSBrain) support translational neuroscience and research transparency.
Unified Cross-Modality Modeling: BrainOmni pioneers joint modeling of EEG and MEG signals through sensor- and device-aware encoding, setting a precedent for future cross-modality or multi-modal foundation models (Xiao et al., 18 May 2025).
Scaling and Real-Time Deployment: Efficient architectures (e.g., EEGM2 leveraging structured state-space models, lightweight ALFEE variants) enable foundation model deployment on resource-constrained BCI and wearable devices, bridging the gap between research and real-time applications (Hong et al., 25 Feb 2025, Xiong et al., 7 May 2025).

Ongoing challenges include exploring scaling laws with even larger model sizes, further improving adaptation to highly variable electrode configurations, extending architectures to support multi-modal learning (EEG, MEG, fMRI), and investigating the integration of neuromorphic priors or broader neurophysiological constraints. As foundation models become a central tool in EEG analysis, the patterns established in recent work suggest continued gains in performance, adaptability, and generalizability as pretraining corpora and architectural sophistication increase.