Papers
Topics
Authors
Recent
2000 character limit reached

Brain-Semantoks: Neural Tokenization in Cognition

Updated 16 December 2025
  • Brain-Semantoks are discrete, semantically meaningful tokens derived from neural embeddings, abstracting cognitive states into compressed, robust units.
  • They integrate subsymbolic neural dynamics with symbolic cognition via bidirectional mappings between high-dimensional representations and discrete token indices, as exemplified by the Tensor-Brain model.
  • Applications include enhanced brain decoding, neurosemantic mapping, and improved cross-modal classification, advancing both theoretical and practical neuroimaging research.

Brain-Semantoks are discrete, semantically meaningful tokens that serve as compressed, robust representations of neural and cognitive states, grounding symbols in high-dimensional neural embeddings. The Brain-Semantok paradigm unifies computational neuroscience, neurosemantic modeling, and neural foundation models by proposing that brains—and by extension artificial neural systems—encode and manipulate meaning via token-like abstractions interfacing subsymbolic dynamics and symbolic cognition. This concept has been instantiated in both theoretical and practical modeling frameworks, notably the Tensor-Brain model of perception and memory, recent self-distilled foundation models for brain dynamics, and systems that extract shared semantic tokens from fMRI using advanced machine learning.

1. Formal and Computational Foundations

At its core, the Brain-Semantok framework is rooted in the Tensor-Brain (TB) model, which posits a bidirectional architecture with two key layers (Tresp et al., 2024, Tresp et al., 2020):

  • Representation layer: a high-dimensional state vector r(t)Rnr(t) \in \mathbb{R}^n (with ensemble firing rates γi(t)\gamma_i(t)) acts as the subsymbolic "global workspace," integrating sensory inputs and context via a recurrent evolution network.
  • Index layer: a vector I(t){0,1}mI(t) \in \{0,1\}^m or Rm\mathbb{R}^m encodes discrete symbolic tokens corresponding to semantic entities, predicates, and time-points.

Mappings between these layers define two central operations:

  • Bottom-up encoding: r(t)r(t) is mapped probabilistically to an index kk (i.e., semantok) via a softmax over the inner product of r(t)r(t) with concept embeddings aka_k. P(Y=k|r(t)) = \text{softmax}k(a{0k} + a_k{\top} r(t)).</li><li><strong>Topdownembodiment</strong>:Activationofsemantok.</li> <li><strong>Top-down embodiment</strong>: Activation of semantok kintheindexlayerprojectsitsembedding in the index layer projects its embedding a_k \in \mathbb{R}^n$ back to the representation layer, reinstantiating a full subsymbolic pattern.</li> </ul> <p>Each semantok&#39;s embedding—its &quot;DNA&quot;—is dynamically updated to distill a stable signature from ongoing perceptual and cognitive activity: $E_c \leftarrow \alpha E_c + (1-\alpha) r_s$ (embedding update rule).

    2. Neural and Functional Interpretations

    Brain-Semantoks formalize the hypothesis that higher-level cognition operates by tokenizing distributed, noisy neural dynamics into compact representative units:

    • Symbol Grounding: Embeddings mediate between distributed sensory statistics and discrete labels, guaranteeing that symbolic processing remains anchored to measurable neural activity.
    • Categorical Perception: The bottom-up sample–top-down feedback loop implements robust decision boundaries in embedding space, allowing categorical distinctions (e.g., "dog" vs. "puma") to be sharpened and contextually modulated.
    • Memory Retrieval: Serial activation of semantoks and propagation of their associated embeddings reconstructs semantic and episodic memory traces (e.g., activating "dog" reinstates associated features like "black," "happy") (Tresp et al., 2024).

    This tokenization principle is evident both in models of explicit scene and language comprehension (e.g., subject-predicate-object decomposition (Tresp et al., 2020)) and in data-driven models where shared, low-dimensional semantic embeddings are extracted from distributed neural activity (Raposo et al., 2019, Efird et al., 2024).

    3. Methods for Extracting and Modeling Brain-Semantoks

    Brain-Semantoks are operationalized via several methodologies, spanning theoretical, unsupervised, and supervised frameworks:

    a. Embedding and Clustering Paradigms

    • Contrastive multimodal models: Use a linear map from fMRI patterns to CLIP or similar semantic spaces, with InfoNCE loss and clustering (e.g., adapted DBSCAN) across decoder weights revealing shared decodable concepts (SDCs) as robust semantoks. Each SDC centroid then serves as a discrete semantic token that can be mapped back to anatomical ROIs (Efird et al., 2024).
    • Low-dimensional canonical embedding: Joint GCCA on multi-subject fMRI aligns idiosyncratic neural spaces into a shared low-dimensional, latent semantic space (C ≪ number of voxels), enabling cross-modal and cross-subject decoding (Raposo et al., 2019).

    b. Self-supervised Foundation Models

    • Semantic tokenizers for dynamics: The Brain-Semantoks foundation model for fMRI time series employs a hierarchical tokenizer, aggregating ROI time series into N functional network tokens by convolutional feature extraction, temporal partitioning, and masking. A self-distillation curriculum (student–teacher EMA, masked token prediction, coding-rate regularizer) stabilizes the learned tokens. These provide robust, task-agnostic embeddings for downstream phenotype and cognition classification, achieving improved out-of-distribution generalization (Gijsen et al., 12 Dec 2025).

    c. Neural Decoding and Brain-Language Alignment

    • Tensor-based symbol-subsymbolic mapping: The TB and related models explicitly tie semantoks and their embeddings to the dynamics of perception, memory, and symbolic reasoning, supporting both bottom-up (percept → symbol) and top-down (symbol → pattern) inference (Tresp et al., 2024, Tresp et al., 2020).
    • Empirical contrastive decoding: In fMRI/image paradigms, decoding accuracy of semantic labels or CLIP-space clusters greatly exceeds chance (e.g., 8.5% top-5 in 413-way fMRI-image retrieval), and the resultant SDCs correspond to category-selective brain regions such as FFA (face), EBA (body), V4 (color), and IPS (numerosity) (Efird et al., 2024).

    4. Neurobiological and Network Distribution

    Brain-Semantoks are not localized to a single anatomical site but reflect emergent, distributed patterns:

    • Distributed semantic kernel: Increasing the number of jointly modeled participants converges voxel importance scores onto a smaller subset of robust, semantically critical voxels, suggesting the existence of a semantic "kernel" amid periphery noise (Raposo et al., 2019).
    • ROI and whole-brain analysis: SDCs and other semantoks reproducibly map to specific visual, associative, or semantic regions depending on category and conceptual dimension (e.g., face↔FFA, color↔V4) (Efird et al., 2024, Liu et al., 26 Feb 2025).
    • Functional network parcellation: In foundation models, grouping ROIs into functional networks (e.g., Yeo 7-networks plus subcortex/cerebellum) and tokenizing across time is critical for capturing domain-general brain dynamics robustly (Gijsen et al., 12 Dec 2025).

    Neurosemantic evidence from large-scale studies establishes that lexical, compositional, syntactic, and semantic representations, while partially separable, jointly engage temporo-parietal, frontal, and associative cortex in a distributed fashion (Caucheteux et al., 2021).

    5. Practical and Theoretical Implications

    Brain-Semantok models have catalyzed advances in several domains:

    a. Brain Decoding and Neurosemantic Mapping

    • Downstream tasks: Learned tokens serve as features for phenotype prediction, cognitive state decoding, cross-modal classification, and retrieval, consistently outperforming raw-voxel or region-based features (Gijsen et al., 12 Dec 2025, Raposo et al., 2019).
    • Interpretability: The semantok abstraction enables interpretable tracing from high-dimensional neural signals to concept-level representations, facilitating neuroanatomically-resolved analyses (e.g., identification of shared vs. individual semantic structure (Efird et al., 2024)).

    b. Symbol Grounding and Embodied Semantics

    • Symbolic cognition: The bidirectionality of index and representation layers realizes a formal mechanism for symbol grounding and the embodiment of abstract semantic tokens in distributed neural substrates (Tresp et al., 2024).
    • Memory and episodic recall: Sequential semantok activation models the serial unfolding of scene recollection, fact retrieval, and hypothesis generation as iterated symbolic sampling constrained by learned embeddings (Tresp et al., 2020).

    c. Generalization, Scaling, and Transfer

    • Self-distillation and tokenization: Foundation models show that global architectural and tokenization scaling laws continue to deliver gains as dataset size and network complexity increase, with pretraining size positively correlated with transfer performance on novel clinical and cognitive benchmarks (Gijsen et al., 12 Dec 2025).
    • Semantic bottleneck: The strong reduction in necessary latent dimension (e.g., 9–11 latent dimensions suffice for music and language classification compared to thousands of voxels) suggests a form of neural semantic compression (Raposo et al., 2019).

    6. Limitations, Extensions, and Open Problems

    Current Brain-Semantok models face several challenges and opportunities:

    • Context sensitivity: While context-dependent modulation of semantic features can be captured by backpropagation-based adjustment of input vectors—shown to align with human judgments at above chance levels—the model-human agreement remains constrained by the inherent noisiness and raters’ reliability (Aguirre-Celis et al., 2020).
    • Cross-modal and linguistic generalization: Embeddings grounded in one modality (e.g., vision) have not yet fully realized direct cross-modal semantic tokenization at the granularity of natural language discourse. Integration with brain-tuned speech/LLMs and multi-modal contrastive codebooks remains an ongoing research trajectory (Moussa et al., 2024, Moussa et al., 4 Jun 2025).
    • Interpretability and compositionality: While high-level brain-semantic conceptual clusters (SDCs, functional networks) are evident, the transition from distributed tokens to high-level compositional logic and reasoning is not yet analytically resolved outside of toy categorical logic models (Heller, 2019).

    Continued scaling, improved alignment, and neurophysiologically motivated architectural advances are expected to further operationalize Brain-Semantoks as both a computational and biological construct for understanding and engineering semantic cognition.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Brain-Semantoks.