Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts

Published 18 Apr 2026 in cs.LG, cs.AI, and eess.SP | (2604.16926v1)

Abstract: Electroencephalography (EEG) foundation models have shown strong potential for learning generalizable representations from large-scale neural data, yet their clinical deployment is hindered by distribution shifts across clinical settings, devices, and populations. Test-time adaptation (TTA) offers a promising solution by enabling models to adapt to unlabeled target data during inference without access to source data, a valuable property in healthcare settings constrained by privacy regulations and limited labeled data. However, its effectiveness for EEG remains largely underexplored. In this work, we introduce NeuroAdapt-Bench, a systematic benchmark for evaluating test-time adaptation methods on EEG foundation models under realistic distribution shifts. We evaluate representative TTA approaches from other domains across multiple pretrained foundation models, diverse downstream tasks, and heterogeneous datasets spanning in-distribution, out-of-distribution, and extreme modality shifts (e.g., Ear-EEG). Our results show that standard TTA methods yield inconsistent gains and often degrade performance, with gradient-based approaches particularly prone to heavy degradation. In contrast, optimization-free methods demonstrate greater stability and more reliable improvements. These findings highlight the limitations of existing TTA techniques in EEG, provide guidance for future development, and underscore the need for domain-specific adaptation strategies.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that gradient-based test-time adaptation often degrades EEG model performance while optimization-free methods offer enhanced stability.
It introduces NeuroAdapt-Bench, a unified pipeline that evaluates supervised fine-tuning and TTA across in-distribution and out-of-distribution EEG datasets.
Empirical results reveal that prototype-based adaptation (such as T3A) significantly improves balanced accuracy and calibration in clinical EEG applications.

Systematic Evaluation of Test-Time Adaptation in EEG Foundation Models under Real-World Distribution Shifts

Motivation and Background

The deployment of EEG foundation models in clinical environments is increasingly contrasted with a persistent challenge: distribution shift. EEG signals are inherently heterogeneous, displaying variability across acquisition devices, recording sites, clinical protocols, and patient-specific physiology. This divergence disrupts the transferability of pretraining-induced representations, causing substantial performance degradation when models are applied outside their training distribution. Test-time adaptation (TTA) enables models to adapt at inference to new, unlabeled target-domain data without source data access, a crucial feature under privacy constraints and limited labeled samples—circumstances prevalent in healthcare. Previous TTA efforts in vision and speech highlight a variety of update mechanisms, often relying on gradient-based and prototype-based strategies, but their efficacy in EEG remains uncertain.

NeuroAdapt-Bench: Benchmark Design and Methodology

NeuroAdapt-Bench is introduced as a rigorous, reproducible pipeline for benchmarking TTA methods on EEG foundation models amidst diverse distribution shifts (Figure 1). The benchmark orchestrates a unified protocol across three stages: supervised fine-tuning on source-domain data, TTA application on unlabeled target-domain samples, and evaluation on shifted target domains. Foundation model encoders (CBraMod, REVE variants, TFM-Tokenizer) are paired with a lightweight classifier, trained on source data, and frozen during adaptation to standardize downstream effects. Three TTA approaches are evaluated: Tent (entropy minimization, gradient-based), SHOT (source-free centroid refinement, gradient-based), and T3A (optimization-free prototype adjustment), in both online (streaming) and offline (batch) adaptation regimes.

Figure 1: Overview of NeuroAdapt-Bench—fine-tuning, adaptation, and evaluation pipeline across diverse models and EEG shifts.

Empirical Findings: TTA Effectiveness under Distribution Shifts

In-Distribution Adaptation

On datasets encompassed by pretraining (TUEV, TUAB), adaptation produced inconsistent gains. Gradient-based methods (Tent, SHOT) frequently degraded balanced accuracy, Cohen's $\kappa$ , and weighted $F_1$ scores. T3A exhibited superior stability with moderate improvements, particularly in TUEV, indicating optimization-free adaptation is less disruptive when source and target distributions are well-aligned.

Figure 2: $\Delta_{\text{TTA}}$ balanced accuracy for TTA methods on in-distribution datasets—substantial degradation for gradient-based approaches; stable gains for T3A.

Out-of-Distribution and Task Shift

On datasets diverging from pretraining (CHB-MIT: seizure detection; SleepEDF-78: sleep staging), the performance landscape shifted. T3A delivered positive gains in balanced accuracy (mean $\Delta$ +18.9 percentage points for REVE-Base on CHB-MIT) and improved class-wise calibration. Conversely, gradient-based TTA was susceptible to negative transfer, especially under severe task and acquisition protocol deviations. TFM-Tokenizer—a discrete tokenization model—demonstrated enhanced robustness against adaptation-triggered degradation, particularly for SHOT.

Figure 3: $\Delta_{\text{TTA}}$ for TTA methods under out-of-distribution shift; T3A delivers reliable improvements where other methods degrade model performance.

Extreme Modality Shift: EarEEG

In cases of cross-modality generalization (EarEEG), the instability of gradient-based TTA was pronounced. T3A maintained modest improvements for CBraMod and REVE, reinforcing the practical advantage of prototype-based adaptation in wearable EEG scenarios.

Batch Size Ablation

Increasing adaptation batch size did not universally improve performance. Gradient-based methods modestly benefitted from larger batches, but T3A remained insensitive to batch size, validating its efficiency in streaming and low-resource settings.

Figure 4: Balanced accuracy improvements across adaptation batch sizes—no consistent positive effect from batch size scaling, especially for optimization-free T3A.

Analysis: Adaptation Strategies, Model Representations, and Clinical Implications

Optimization-free TTA (T3A) outperformed gradient-based counterparts in terms of stability and reliability across all shift regimes. The results exhibit that aggressive gradient-based updates (Tent, SHOT) often perturb pretrained feature representations and negatively impact downstream clinical metric alignment. The intrinsic representation scheme—continuous versus discrete tokenization—directly influences adaptation sensitivity, with discrete tokenization models (TFM-Tokenizer) showing lower susceptibility to performance drops.

Comparisons between online and offline TTA reveal that update strategy (prototype versus normalization layer adjustment) eclipses the adaptation mode in determining stability. Overall, TTA efficacy is contingent on both the adaptation mechanism and the representational backbone, underscoring the requirement for EEG-specific TTA formulations.

Implications and Future Directions

The inconsistent and sometimes detrimental impact of standard TTA—especially gradient-based approaches—on EEG foundation models accentuates the necessity for novel adaptation strategies tuned to EEG’s nonstationarity and clinical deployment constraints. Prototype-based, optimization-free methods prioritize representation preservation and deliver practical advantages in patient-disjoint, privacy-constrained clinical workflows. The observed representational dependency highlights the importance of designing TTA architectures co-evolved with backbone paradigms, such as tokenization-based EEG models.

Practically, robust, reproducible benchmarking protocols like NeuroAdapt-Bench are fundamental for revealing failure modes and guiding the development of reliable healthcare AI infrastructure. The scalability of T3A to streaming modalities is particularly relevant for real-time bedside monitoring and wearables.

Theoretically, future research should explore hybrid adaptation mechanisms, meta-learning strategies for rapid cross-domain calibration, and integration of domain knowledge to enhance distribution shift resilience. Addressing computational efficiency and adaptation bandwidth limitations for larger foundation models remains a critical challenge.

Conclusion

NeuroAdapt-Bench provides a comprehensive evaluation platform for TTA methods in EEG foundation models under realistic clinical distribution shifts. The empirical results emphasize that standard TTA techniques imported from vision and speech yield unreliable improvements for EEG and even degrade performance under mild and severe shifts. Optimization-free approaches outperform gradient-based TTA consistently, suggesting that stability-centric methods are imperative for EEG deployment. The study recommends prioritizing robust, minimal-update adaptation strategies, domain-specific benchmarking, and alignment between representation type and adaptation mechanism for future AI development in healthcare EEG analysis.

Markdown Report Issue