Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model

Published 12 Dec 2025 in cs.LG, cs.CV, and q-bio.NC | (2512.11582v1)

Abstract: The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through a novel training curriculum, ensuring the model robustly learns meaningful features from low signal-to-noise time series. We demonstrate that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe. Furthermore, we provide comprehensive scaling analyses indicating more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.

Abstract PDF Chat (Pro)

Summary

The paper introduces an innovative semantic tokenizer that aggregates ROI signals into network tokens, enhancing signal-to-noise ratio and reducing domain shifts.
The methodology leverages a self-distillation framework with slice-masking and teacher-guided temporal regularization to align temporally distinct fMRI views.
Experimental results demonstrate superior linear probe accuracy on OOD clinical and cognitive tasks, validating the model's generalizability for phenotype prediction.

Brain-Semantoks: Self-Distilled Semantic Tokenization of Brain fMRI Dynamics

Motivation and Challenges in fMRI Foundation Modeling

The development of robust foundation models for fMRI time series is critical for advancing phenotype prediction in both cognitive and clinical neuroscience. Existing approaches predominantly utilize mask-and-reconstruct objectives focused on regional BOLD signals, adapting paradigms from NLP and vision. This focus on low-level representations impedes downstream generalization: representations remain vulnerable to noise, temporal fluctuations, and dataset/domain shift, hence requiring extensive fine-tuning. The heterogeneity in acquisition parameters and cohorts further exacerbates transfer challenges.

Methodological Innovations

Brain-Semantoks introduces a paradigm shift towards the direct learning of abstract, temporally stable, semantic representations of human brain activity. This is achieved via three central architectural and algorithmic contributions:

Semantic Tokenizer

The model departs from ROI-level tokenization, which yields long and noisy input sequences incompatible with efficient transformer-based modeling. Instead, a neuroscientifically-anchored semantic tokenizer aggregates ROI signals into tokens representing canonical functional networks (e.g., Yeo networks, subcortical, cerebellar). This aggregation is performed within each network via multi-scale convolutional filter banks (Convstd and Convstr), enabling extraction of hierarchical temporal features over long-range patches. The output is a compressed, robust, and semantically meaningful sequence with substantially enhanced SNR and inductive bias for brain dynamics.

Self-Distillation Objective with Curriculum

A self-distillation framework aligns the representations of temporally-distinct yet phenotypically consistent views (temporal crops with mild corruptive augmentations) from the same fMRI scan. The teacher network, updated via EMA, acts as a stable target for the student. The loss acts across the representations of the global [CLS] token and masked network tokens, and includes a coding rate regularizer to prevent representational collapse. Critically, slice-masking is employed to induce the learning of non-interpolative, high-level dependencies by masking entire functional or temporal slices in the student input.

Given the instability arising from low SNR in fMRI signals, an additional Teacher-guided Temporal Regularizer (TTR) is used early in pretraining. This curriculum constrains the student to match temporally-averaged network representations before introducing complex temporal objectives. The regularizer then decays, ensuring global, non-collapsed pretraining convergence.

Transformer Backbone

The semantic tokens are embedded and processed by a multi-layer transformer encoder with positional and network-type embeddings. The use of a self-distillation head for projection enables both masked token and summary representation alignment during pretraining, while downstream tasks make use only of the backbone embeddings.

Experimental Validation and Results

Datasets and Evaluation Protocol

Pretraining is performed on the largest available resting-state fMRI corpus (UKBioBank, $n > 39,000$ runs). Downstream generalization is assessed on strictly held-out data and on a range of OOD datasets covering demographic, clinical (ASD, MDD, Schizophrenia), and cognitive phenotypes (e.g., HBN, ABIDE, LEMON, SRPBS). Task labels are binned for multi-class classification. Evaluation is performed in two modes: linear probing (frozen backbone) and full fine-tuning. All baselines use matched pretraining on UKB data and standardized evaluation pipelines.

Quantitative Performance

Linear Probe Accuracy: Brain-Semantoks achieves top performance on 8/9 tasks, with particularly marked improvements on out-of-distribution clinical datasets (e.g., ASD, MDD, Schizophrenia). For ABIDE (ASD vs control), balanced accuracy rises to 65.13% versus 53.8% (BrainLM) and 52.9% (Brain-JEPA). For SRPBS schizophrenia, performance increases to 69.26% (from 57.6%/57.6%).
Comparison to Fully Supervised and Finetuned Models: Representations learned by Brain-Semantoks with only a frozen linear probe rival and in many cases surpass those of fully supervised or fine-tuned models. For UKB sex/age and cognitive prediction, linear probing consistently outperforms, demonstrating superior factorizability and abstraction of learned representations.
Task-based fMRI (Hariri Emotion Task): Transfer to event-related fMRI yields strong results despite a temporal mismatch with pretraining (block classification accuracy up to 96.5%).
Scaling Laws: Systematic scaling analyses demonstrate the existence of power-law improvements in both in-domain and out-of-domain prediction accuracy with increasing pretraining size. Gains do not saturate for OOD clinical/cognitive tasks, highlighting the criticality of data scaling for generalizable representation learning in neuroimaging.
Ablation Studies: Removing the semantic tokenizer, TTR, or global distillation loss degrades accuracy and convergence. Tokenizer ablations confirm that functional network aggregation outperforms both finer (ROI-wise) and coarser spatial strategies.

Interpretability and Robustness

Due to in-training slice-masking, Brain-Semantoks supports in-distribution, network-specific occlusion analysis: masking all but one functional network identifies the main contributors for each phenotype, matching or sometimes challenging neuroscience priors (e.g., cerebellar dominance in MDD prediction). The learned network embedding structure aligns with established connectivity hierarchies. Transfer performance shows robustness to domain shifts in scanner, TR, and spatial resolution, unlike lower-level tokenization approaches.

Theoretical and Practical Implications

Brain-Semantoks embodies a key theoretical advance: fMRI foundation models should prioritize capturing stable, abstract, network-level signatures, not raw signal reconstruction. This mirrors the prior in Transformer-based representation learning that semantic, low-noise tokens yield models better suited for zero-/few-shot generalization, especially under severe domain heterogeneity. By explicitly separating the learning of abstract network dynamics from ROI-specific noise, the model defines a new regime for brain dynamics modeling that supports robust transfer to diverse clinical, demographic, and cognitive prediction tasks.

Practically, the reduction in reliance on fine-tuning and domain adaptation, together with reduced hardware requirements (single GPU, <20GB), provides an accessible pathway to deploy foundation models across neuroimaging datasets with minimal additional annotation.

Future Directions

Potential extensions include:

Data-driven or task-conditional functional network aggregation, allowing dynamic token construction.
Joint pretraining on both resting-state and task-based fMRI, with explicit modeling of task-related state changes.
Integration with diffusion generative models to enable direct simulation or augmentation of brain dynamics.
Exploration of cross-modal SSL leveraging EEG/MEG or behavioral time series aligned with fMRI dynamics.
Application to individualized prediction across large-scale biobanks for precision psychiatry and neurology.

Conclusion

Brain-Semantoks defines a new standard for fMRI foundation models via abstract, network-level semantic tokenization and self-distillation. The approach yields significantly improved, highly generalizable representations, as evidenced by strong linear probing performance in both in-domain and challenging OOD settings. These results establish semantic SSL and domain-agnostic inductive bias as central components for the future of neuroimaging AI research.