Brain-Semantoks: Neural Tokenization in Cognition

Updated 16 December 2025

Brain-Semantoks are discrete, semantically meaningful tokens derived from neural embeddings, abstracting cognitive states into compressed, robust units.
They integrate subsymbolic neural dynamics with symbolic cognition via bidirectional mappings between high-dimensional representations and discrete token indices, as exemplified by the Tensor-Brain model.
Applications include enhanced brain decoding, neurosemantic mapping, and improved cross-modal classification, advancing both theoretical and practical neuroimaging research.

Brain-Semantoks are discrete, semantically meaningful tokens that serve as compressed, robust representations of neural and cognitive states, grounding symbols in high-dimensional neural embeddings. The Brain-Semantok paradigm unifies computational neuroscience, neurosemantic modeling, and neural foundation models by proposing that brains—and by extension artificial neural systems—encode and manipulate meaning via token-like abstractions interfacing subsymbolic dynamics and symbolic cognition. This concept has been instantiated in both theoretical and practical modeling frameworks, notably the Tensor-Brain model of perception and memory, recent self-distilled foundation models for brain dynamics, and systems that extract shared semantic tokens from fMRI using advanced machine learning.

1. Formal and Computational Foundations

At its core, the Brain-Semantok framework is rooted in the Tensor-Brain (TB) model, which posits a bidirectional architecture with two key layers (Tresp et al., 2024, Tresp et al., 2020):

Representation layer: a high-dimensional state vector $r(t) \in \mathbb{R}^n$ (with ensemble firing rates $\gamma_i(t)$ ) acts as the subsymbolic "global workspace," integrating sensory inputs and context via a recurrent evolution network.
Index layer: a vector $I(t) \in \{0,1\}^m$ or $\mathbb{R}^m$ encodes discrete symbolic tokens corresponding to semantic entities, predicates, and time-points.

Mappings between these layers define two central operations:

Bottom-up encoding: $r(t)$ $r (t)$ is mapped probabilistically to an index $k$ $k$ (i.e., semantok) via a softmax over the inner product of $r(t)$ $r (t)$ with concept embeddings $a_k$ $a_{k}$ . P(Y=k|r(t)) = \text{softmax}k(a{0k} + a_k^{\top} r(t)) $.</li> <li><strong>Top-down embodiment</strong>: Activation of semantok$ k $in the index layer projects its embedding$ $in t h e in d e x l a yer p ro j ec t s i t se mb e dd in g$ a_k \in \mathbb{R}^n$ back to the representation layer, reinstantiating a full subsymbolic pattern.</li> </ul> <p>Each semantok's embedding—its "DNA"—is dynamically updated to distill a stable signature from ongoing perceptual and cognitive activity: $E_c \leftarrow \alpha E_c + (1-\alpha) r_s$ (embedding update rule).

2. Neural and Functional Interpretations

Brain-Semantoks formalize the hypothesis that higher-level cognition operates by tokenizing distributed, noisy neural dynamics into compact representative units:
- Symbol Grounding: Embeddings mediate between distributed sensory statistics and discrete labels, guaranteeing that symbolic processing remains anchored to measurable neural activity.
- Categorical Perception: The bottom-up sample–top-down feedback loop implements robust decision boundaries in embedding space, allowing categorical distinctions (e.g., "dog" vs. "puma") to be sharpened and contextually modulated.
- Memory Retrieval: Serial activation of semantoks and propagation of their associated embeddings reconstructs semantic and episodic memory traces (e.g., activating "dog" reinstates associated features like "black," "happy") (Tresp et al., 2024).
This tokenization principle is evident both in models of explicit scene and language comprehension (e.g., subject-predicate-object decomposition (Tresp et al., 2020)) and in data-driven models where shared, low-dimensional semantic embeddings are extracted from distributed neural activity (Raposo et al., 2019, Efird et al., 2024).

3. Methods for Extracting and Modeling Brain-Semantoks

Brain-Semantoks are operationalized via several methodologies, spanning theoretical, unsupervised, and supervised frameworks:

a. Embedding and Clustering Paradigms
- Contrastive multimodal models: Use a linear map from fMRI patterns to CLIP or similar semantic spaces, with InfoNCE loss and clustering (e.g., adapted DBSCAN) across decoder weights revealing shared decodable concepts (SDCs) as robust semantoks. Each SDC centroid then serves as a discrete semantic token that can be mapped back to anatomical ROIs (Efird et al., 2024).
- Low-dimensional canonical embedding: Joint GCCA on multi-subject fMRI aligns idiosyncratic neural spaces into a shared low-dimensional, latent semantic space (C ≪ number of voxels), enabling cross-modal and cross-subject decoding (Raposo et al., 2019).
b. Self-supervised Foundation Models
- Semantic tokenizers for dynamics: The Brain-Semantoks foundation model for fMRI time series employs a hierarchical tokenizer, aggregating ROI time series into N functional network tokens by convolutional feature extraction, temporal partitioning, and masking. A self-distillation curriculum (student–teacher EMA, masked token prediction, coding-rate regularizer) stabilizes the learned tokens. These provide robust, task-agnostic embeddings for downstream phenotype and cognition classification, achieving improved out-of-distribution generalization (Gijsen et al., 12 Dec 2025).
c. Neural Decoding and Brain-Language Alignment
- Tensor-based symbol-subsymbolic mapping: The TB and related models explicitly tie semantoks and their embeddings to the dynamics of perception, memory, and symbolic reasoning, supporting both bottom-up (percept → symbol) and top-down (symbol → pattern) inference (Tresp et al., 2024, Tresp et al., 2020).
- Empirical contrastive decoding: In fMRI/image paradigms, decoding accuracy of semantic labels or CLIP-space clusters greatly exceeds chance (e.g., 8.5% top-5 in 413-way fMRI-image retrieval), and the resultant SDCs correspond to category-selective brain regions such as FFA (face), EBA (body), V4 (color), and IPS (numerosity) (Efird et al., 2024).
4. Neurobiological and Network Distribution

Brain-Semantoks are not localized to a single anatomical site but reflect emergent, distributed patterns:
- Distributed semantic kernel: Increasing the number of jointly modeled participants converges voxel importance scores onto a smaller subset of robust, semantically critical voxels, suggesting the existence of a semantic "kernel" amid periphery noise (Raposo et al., 2019).
- ROI and whole-brain analysis: SDCs and other semantoks reproducibly map to specific visual, associative, or semantic regions depending on category and conceptual dimension (e.g., face↔FFA, color↔V4) (Efird et al., 2024, Liu et al., 26 Feb 2025).
- Functional network parcellation: In foundation models, grouping ROIs into functional networks (e.g., Yeo 7-networks plus subcortex/cerebellum) and tokenizing across time is critical for capturing domain-general brain dynamics robustly (Gijsen et al., 12 Dec 2025).
Neurosemantic evidence from large-scale studies establishes that lexical, compositional, syntactic, and semantic representations, while partially separable, jointly engage temporo-parietal, frontal, and associative cortex in a distributed fashion (Caucheteux et al., 2021).

5. Practical and Theoretical Implications

Brain-Semantok models have catalyzed advances in several domains:

a. Brain Decoding and Neurosemantic Mapping
- Downstream tasks: Learned tokens serve as features for phenotype prediction, cognitive state decoding, cross-modal classification, and retrieval, consistently outperforming raw-voxel or region-based features (Gijsen et al., 12 Dec 2025, Raposo et al., 2019).
- Interpretability: The semantok abstraction enables interpretable tracing from high-dimensional neural signals to concept-level representations, facilitating neuroanatomically-resolved analyses (e.g., identification of shared vs. individual semantic structure (Efird et al., 2024)).
b. Symbol Grounding and Embodied Semantics
- Symbolic cognition: The bidirectionality of index and representation layers realizes a formal mechanism for symbol grounding and the embodiment of abstract semantic tokens in distributed neural substrates (Tresp et al., 2024).
- Memory and episodic recall: Sequential semantok activation models the serial unfolding of scene recollection, fact retrieval, and hypothesis generation as iterated symbolic sampling constrained by learned embeddings (Tresp et al., 2020).
c. Generalization, Scaling, and Transfer
- Self-distillation and tokenization: Foundation models show that global architectural and tokenization scaling laws continue to deliver gains as dataset size and network complexity increase, with pretraining size positively correlated with transfer performance on novel clinical and cognitive benchmarks (Gijsen et al., 12 Dec 2025).
- Semantic bottleneck: The strong reduction in necessary latent dimension (e.g., 9–11 latent dimensions suffice for music and language classification compared to thousands of voxels) suggests a form of neural semantic compression (Raposo et al., 2019).
6. Limitations, Extensions, and Open Problems

Current Brain-Semantok models face several challenges and opportunities:
- Context sensitivity: While context-dependent modulation of semantic features can be captured by backpropagation-based adjustment of input vectors—shown to align with human judgments at above chance levels—the model-human agreement remains constrained by the inherent noisiness and raters’ reliability (Aguirre-Celis et al., 2020).
- Cross-modal and linguistic generalization: Embeddings grounded in one modality (e.g., vision) have not yet fully realized direct cross-modal semantic tokenization at the granularity of natural language discourse. Integration with brain-tuned speech/LLMs and multi-modal contrastive codebooks remains an ongoing research trajectory (Moussa et al., 2024, Moussa et al., 4 Jun 2025).
- Interpretability and compositionality: While high-level brain-semantic conceptual clusters (SDCs, functional networks) are evident, the transition from distributed tokens to high-level compositional logic and reasoning is not yet analytically resolved outside of toy categorical logic models (Heller, 2019).
Continued scaling, improved alignment, and neurophysiologically motivated architectural advances are expected to further operationalize Brain-Semantoks as both a computational and biological construct for understanding and engineering semantic cognition.

PDF Markdown Chat (Pro)

References (11)

1.

How the (Tensor-) Brain uses Embeddings and Embodiment to Encode Senses and Symbols (2024)

2.

The Tensor Brain: Semantic Decoding for Perception and Memory (2020)

3.

Low-dimensional Embodied Semantics for Music and Language (2019)

4.

Finding Shared Decodable Concepts and their Negations in the Brain (2024)

5.

Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model (2025)

6.

Talking to the brain: Using Large Language Models as Proxies to Model Brain Semantic Representation (2025)

7.

Disentangling Syntax and Semantics in the Brain with Deep Networks (2021)

8.

Characterizing the Effect of Sentence Context on Word Meanings: Mapping Brain to Behavior (2020)

9.

Improving Semantic Understanding in Speech Language Models via Brain-tuning (2024)

10.

Brain-tuned Speech Models Better Reflect Speech Processing Stages in the Brain (2025)

11.

The Homunculus Brain and Categorical Logic (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Brain-Semantoks.

Brain-Semantoks: Neural Tokenization in Cognition

1. Formal and Computational Foundations

2. Neural and Functional Interpretations

3. Methods for Extracting and Modeling Brain-Semantoks

a. Embedding and Clustering Paradigms

b. Self-supervised Foundation Models

c. Neural Decoding and Brain-Language Alignment

4. Neurobiological and Network Distribution

5. Practical and Theoretical Implications

a. Brain Decoding and Neurosemantic Mapping

b. Symbol Grounding and Embodied Semantics

c. Generalization, Scaling, and Transfer

6. Limitations, Extensions, and Open Problems

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Brain-Semantoks: Neural Tokenization in Cognition

1. Formal and Computational Foundations

2. Neural and Functional Interpretations

3. Methods for Extracting and Modeling Brain-Semantoks

a. Embedding and Clustering Paradigms

b. Self-supervised Foundation Models

c. Neural Decoding and Brain-Language Alignment

4. Neurobiological and Network Distribution

5. Practical and Theoretical Implications

a. Brain Decoding and Neurosemantic Mapping

b. Symbol Grounding and Embodied Semantics

c. Generalization, Scaling, and Transfer

6. Limitations, Extensions, and Open Problems

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research