EEG Pre-training: Methods & Applications

Updated 9 February 2026

EEG pre-training is the process of learning general-purpose neural representations from minimally processed EEG signals using architectures like transformers, CNNs, and GNNs.
It employs methods such as masked autoencoding, contrastive learning, and autoregressive modeling to overcome challenges like non-stationarity, low signal-to-noise ratio, and channel variability.
Empirical studies show that pre-trained models improve downstream classification and regression performance, ensuring better generalization across diverse datasets and recording setups.

Electroencephalography (EEG) pre-training denotes the process of learning general-purpose neural representations from raw or minimally pre-processed EEG signals, prior to fine-tuning on labeled or domain-specific tasks. This approach leverages large volumes of unlabeled or synthetically labeled EEG data to parameterize deep neural networks—most commonly transformer-based architectures, convolutional networks, or graph neural networks—such that downstream performance on classification, regression, or generative tasks is improved, particularly under data scarcity, montage variability, and distributional shifts.

1. Rationale and Historical Evolution of EEG Pre-training

EEG signals exhibit high non-stationarity, low signal-to-noise ratio, pronounced subject and hardware variability, and channel-montage heterogeneity. The need for generalization across experimental paradigms, individuals, and recording setups makes end-to-end supervised learning brittle, especially when annotated data are limited or expensive to acquire. Early EEG pre-training solutions were inspired by transfer learning in computer vision and speech, applying supervised or unsupervised training on proxy tasks (e.g., autoencoding, contrastive prediction, or error decoding), and demonstrated that pre-trained models improved low-data performance and cross-task transfer—even when only small fractions (≤10%) of labeled data were available for fine-tuning (Behncke et al., 2018).

Subsequent advances adapted self-supervised representation learning (SSL) paradigms—specifically masked autoencoding, contrastive loss, cluster-based pseudo-labeling, graph-pretext tasks, and autoregressive modeling—to the idiosyncrasies of neurophysiological time series (Liu et al., 19 Jun 2025). The field now encompasses both generic foundation models (e.g., REVE (Ouahidi et al., 24 Oct 2025), EEGPT (Yue et al., 2024)) and task- or domain-specific pipelines tailored for medical, cognitive, or affective EEG decoding.

2. Core Architectural and Methodological Principles

A defining trait of EEG pre-training is the targeting of multiple latent structures in the signal:

Temporo-spectral Modeling: Extraction of both temporal and frequency-domain (spectral) dynamics is critical. Architectures often construct parallel views—temporal, spectral (via STFT/FFT), and spatial (via channel topology or GNNs)—and fuse these via self-attention and cross-attention (Liu et al., 19 Jun 2025, Wang et al., 2024).
Channel Adaptation: To manage variable channel layouts, models such as CRIA introduce learnable channel embeddings $E_{channel}\in\mathbb{R}^{C_{max}\times D}$ , adapting seamlessly across datasets and systems (Liu et al., 19 Jun 2025).
Tokenization and Patchifying: EEG is tokenized along the time axis (fixed-length windows, e.g., $L=256$ ) and sometimes spatially (per-channel or per-region patches), producing sequences suitable for transformer inputs (Ouahidi et al., 24 Oct 2025, Zhang et al., 20 Jun 2025). Spectral tokenization via vector-quantized variational autoencoders (VQ-VAE) is also employed for aggressive dimensionality/compression (Bettinardi et al., 13 Mar 2025).
Information Bottleneck and Masking: High mask ratios (40–75%) and masking strategies (block, random, elementwise; spectral, spatial, or temporal) encourage networks to capture global structure and robust local features (Zhou et al., 2024, Sandino et al., 14 Nov 2025).
Pretext Objectives: Prevailing objectives include masked signal modeling (MSM), masked autoencoding (MAE), contrastive learning, pairwise relative shift prediction (PARS), autoregressive next-sample prediction, and cross-modal/graph-based reconstruction (Liu et al., 19 Jun 2025, Wang et al., 2024, Sandino et al., 14 Nov 2025, Yue et al., 2024).
Instance and Modality Adaptation: Recent frameworks model EEG and intracranial EEG (iEEG) together, unifying cross-modality representations using channel-independent transformers, frequency-domain quantization, and secondary loss heads (Zhang et al., 20 Jun 2025).

3. Canonical Pre-training Paradigms and Objective Functions

A taxonomy of current EEG pre-training approaches is provided below:

Paradigm	Objective/Loss	Masking Strategy	Notable Models
Masked Autoencoder	$L_{rec} = \frac{1}{\|\mathcal{M}\|}\sum_{j\in \mathcal{M}}\\|z_j - \hat{z}_j\\|_1$ (Ouahidi et al., 24 Oct 2025)	Block/spatial/temporal	MAE-EEG (Zhou et al., 2024), REVE (Ouahidi et al., 24 Oct 2025)
Contrastive	$\mathcal{L}_{c} = -\sum_{i}\log\frac{\exp(\text{sim}(\tilde{z}_i, z_i) / \tau)}{\sum_j \exp(\text{sim}(\tilde{z}_i, z_j^-) / \tau)}$ (Wang et al., 2024)	Augment/pairwise	GEFM (Wang et al., 2024), DisGCMAE (Wei et al., 2024)
Autoregressive	$L_{AR} = \frac{1}{T}\sum_{t=1}^T\\|x_t - \hat{x}_t(x_{<t})\\|_2^2$ (Yue et al., 2024)	Causal (next-token)	EEGPT (Yue et al., 2024)
Spectral VQ-Masked	VQ loss + cross-entropy over codebook indices (Bettinardi et al., 13 Mar 2025)	Masked token (75%)	BioSerenity-E1 (Bettinardi et al., 13 Mar 2025)
Pairwise Shift/PARS	$L_{PARS} = \\|\Theta - \hat{\Theta}\\|_2^2$ (Sandino et al., 14 Nov 2025)	Masked PE pairs (80%)	PARS (Sandino et al., 14 Nov 2025)
Cross-View/Modal	Joint contrastive + MSE over masked views (Liu et al., 19 Jun 2025)	View-wise masking	CRIA (Liu et al., 19 Jun 2025), CET-MAE (Wang et al., 2024)

Masking is leveraged not only for data augmentation, but as an information bottleneck, regularizer, and a driver for long-range compositionality (Liu et al., 19 Jun 2025, Sandino et al., 14 Nov 2025). Losses often combine contrastive, instance-wise, view-wise, or cross-modal terms.

4. Empirical Performance and Transferability

EEG pre-training enhances both efficiency and accuracy under a variety of evaluation metrics and data regimes:

Supervised Learning Under Scarcity: Pre-trained models consistently outperform non-pretrained baselines by multiple percentage points in balanced accuracy, F1, AUROC, and AUPRC, especially under low-label (≤10%) and cross-domain conditions (Liu et al., 19 Jun 2025, Ouahidi et al., 24 Oct 2025, Bettinardi et al., 13 Mar 2025, Bary et al., 2024, Wang et al., 2024, Sandino et al., 14 Nov 2025). CRIA achieves BACC = 0.8003 on anomaly detection and 0.5702 on multi-class event detection, superior to BrainBERT, BIOT, LaBraM, etc. (Liu et al., 19 Jun 2025).
Cross-Dataset and Montage Generalization: Pre-training with variable-length/channel coding, channel-agnostic encoders, or graph-based adaptation enables seamless transfer among datasets, montages, and acquisition setups (e.g., TUAB $\to$ CHB-MIT, across 10–20, ECoG, and SEEG) (Liu et al., 19 Jun 2025, Ouahidi et al., 24 Oct 2025, Zhang et al., 20 Jun 2025).
Ablation and Scaling Laws: Performance scales monotonically with model and data size—all else equal, larger pre-trained EEG foundation models (e.g., EEGPT Giant, 1.09B parameters) dominate on multi-task and multi-dataset settings (Yue et al., 2024). Optimal mask ratios are task-dependent, but 40–75% masking is typically best for MAE and MSM-style objectives (Zhou et al., 2024, Bai et al., 2023).
Representation Analysis: Feature attributions and t-SNE embeddings show that pre-trained networks capture physiologically relevant dynamics (e.g., α oscillations, spatial channel interactions), facilitate subject invariance, and encode interpretable factors (e.g., error-specific high-gamma in iEEG) (Behncke et al., 2018, Liu et al., 19 Jun 2025, Wang et al., 2024).

5. Specialized Pre-training Variants and Domain Extensions

EEG pre-training has diversified into several specialized sub-fields:

Graph-based Pre-training: Graph neural encoders pre-trained with joint contrastive/masked autoencoding objectives (e.g., DisGCMAE) unify high- and low-density EEG through topology distillation and KL-based similarity loss, proving effective for channel-missing and cross-resolution domains (Wei et al., 2024, Wang et al., 2024).
Synthetic and Knowledge-Guided Pre-training: Frequency pretraining (FPT) on synthetically generated oscillatory signals enables learning robust bandpower filters without patient data, facilitating privacy and scalability (Grieger et al., 2024, Kommineni et al., 2024).
Multi-modal and Multi-task Pre-training: Frameworks such as MCSP perform cross-domain SSL aligning EEG, fMRI, and their respective spatio-temporal/spectral representations jointly (Wei et al., 2024). Task-specific, multi-dataset pre-training with covariance alignment realizes few- and zero-shot generalization for emotion recognition (Zhang et al., 25 Oct 2025).
Open-Ended and Language Pre-training: EEG2Text and CET-MAE integrate masked EEG and text prediction in multi-stream transformers, including hybrid contrastive and masked-reconstruction losses for brain-to-text generation (Wang et al., 2024, Liu et al., 2024).

6. Limitations, Challenges, and Emerging Directions

Major limitations and open issues in EEG pre-training, as identified across the literature, include:

Channel Embedding Scalability: Learnable channel encoding tables (e.g., $E_{channel}$ ) scale linearly with the union of possible electrode labels and may require clustering or pruning for ultra-large systems (Liu et al., 19 Jun 2025).
Interpretability and Explainability: While feature visualizations and attention maps offer qualitative insight, formal clinical interpretability remains largely unexplored, motivating future incorporation of explainable AI modules (Liu et al., 19 Jun 2025, Wang et al., 2024).
Real-time and Resource Efficiency: Most foundation models have yet to address strict real-time or memory-constrained environments (mobile BCI, edge computing), though parameter-efficient and low-profile pipelines are emerging (Ogg et al., 2 Jun 2025).
Multimodal and Cross-species Extensions: Only a subset of frameworks address joint EEG–fMRI, multimodal BCI, or cross-species (animal–human) pre-training. Methodological generalization to noninvasive or invasive domains is ongoing (Zhang et al., 20 Jun 2025, Wei et al., 2024).
Overfitting and Masking Strategies: Randomized masking, if not regularized or learned, can both under- and overfit; structured or task-aware sparsification of attention/feature space may offer further improvements (Liu et al., 19 Jun 2025, Sandino et al., 14 Nov 2025).
Data Annotation and Biases: Despite progress, annotation bottlenecks remain, and domain shifts across corpus/hardware/protocol boundaries are only partially addressed by unsupervised or knowledge-guided losses (Ouahidi et al., 24 Oct 2025, Wang et al., 2024).

Recommended extensions include scaling foundation model pre-training to multicenter datasets with >1000 participants/hours (Liu et al., 19 Jun 2025, Ouahidi et al., 24 Oct 2025), structured channel and montage hierarchies, explicit explainability, and integration with other biosignals (fMRI, eye-tracking, physiological phenotyping).

7. Synthesis and Prospective Outlook

EEG pre-training, through its combination of masked autoencoding, contrastive, autoregressive, synthetic, knowledge-guided, and cross-modal learning paradigms, has become foundational for generalizable brain decoding. Flexible adaptation to variable-length, variable-channel, and multi-domain data—exemplified by models such as CRIA (Liu et al., 19 Jun 2025), REVE (Ouahidi et al., 24 Oct 2025), EEGPT (Yue et al., 2024), BioSerenity-E1 (Bettinardi et al., 13 Mar 2025), PARS (Sandino et al., 14 Nov 2025), and GEFM (Wang et al., 2024)—now enables robust transfer across pathologies, paradigms, and populations with improved sample efficiency, training speed, and downstream convergence. The field is converging toward large-scale, open-vocabulary, multi-modal, and clinical-grade EEG foundation models, while ongoing innovation is needed in interpretability, resource efficiency, and clinical deployment.

The rich interplay among temporal, spectral, spatial, and semantic representations, combined with continual scaling of pre-training corpora, establishes EEG pre-training as a critical enabler of next-generation neurotechnological and neuroscientific discovery.