Papers
Topics
Authors
Recent
Search
2000 character limit reached

sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment

Published 20 Apr 2025 in cs.CL, cs.LG, eess.SP, and q-bio.NC | (2504.14468v1)

Abstract: Interpreting neural activity through meaningful latent representations remains a complex and evolving challenge at the intersection of neuroscience and artificial intelligence. We investigate the potential of multimodal foundation models to align invasive brain recordings with natural language. We present SSENSE, a contrastive learning framework that projects single-subject stereo-electroencephalography (sEEG) signals into the sentence embedding space of a frozen CLIP model, enabling sentence-level retrieval directly from brain activity. SSENSE trains a neural encoder on spectral representations of sEEG using InfoNCE loss, without fine-tuning the text encoder. We evaluate our method on time-aligned sEEG and spoken transcripts from a naturalistic movie-watching dataset. Despite limited data, SSENSE achieves promising results, demonstrating that general-purpose language representations can serve as effective priors for neural decoding.

Summary

sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment

This paper presents a novel approach to brain-to-language mapping through a system called SSENSE (Subject-wise sEEG-based Encoding for Sentence Retrieval), which aims to align invasive brain recordings with natural language using a contrastive learning framework. The research explores the use of single-subject stereo-electroencephalography (sEEG) signals and projects them into the sentence embedding space of a pre-trained CLIP model—a foundation model initially designed for vision-language tasks. By leveraging InfoNCE loss for training a neural encoder on spectral representations of sEEG without fine-tuning the text encoder, SSENSE achieves promising results in sentence retrieval tasks.

The motivation behind this research lies in the ability to decode mental content from brain activity—a task that has been significantly advanced by multimodal foundation models like CLIP and ALIGN. However, the extension of these models to handle high-temporal-resolution neural signals such as sEEG remains largely unexplored. This paper addresses the gap by proposing a framework that grounds neural data into a shared semantic space with natural language, thus enabling zero-shot sentence retrieval capabilities directly from brain activity.

Methods and Approach

The framework employs a contrastive learning strategy where sEEG recordings are aligned with corresponding sentence embeddings from a CLIP text encoder. Important aspects of the methodological design include:

  • sEEG Preprocessing: The raw sEEG data is transformed into time-frequency representations using superlet transforms, followed by zero-padding for standardization.

  • Neural Encoder Architecture: A modified ResNet-18 is used to encode spectrograms of the sEEG data. The model architecture is adapted to process single-channel inputs and to output embedding vectors in the same dimensionality as the sentence embeddings from CLIP (512-dimensional).

  • Data Augmentation: Strategies such as time-frequency masking and electrode channel masking are implemented to enhance model robustness, addressing the variabilities and noise in neural recordings.

  • Training and Optimization: InfoNCE loss is utilized to encourage the alignment of the sEEG and text representations. Training optimization leverages the Adam optimizer with specific learning parameters and includes early stopping criteria based on validation performance.

Experimental Evaluation

Experiments were conducted using a single-subject dataset from the Brain Treebank project, specifically analyzing sEEG recordings aligned with transcripts from the movie "Ant-Man." The analysis focused on evaluating how well the SSENSE framework could retrieve sentences from the sEEG embeddings in a zero-shot fashion. The results demonstrated that SSENSE significantly surpasses the random baseline in sentence retrieval performance, across varying data augmentation conditions.

Results and Significance

The paper provides evidence that even without task-specific fine-tuning, general-purpose language models like CLIP can serve as effective priors for decoding linguistic content from neural signals. Notably, the no-masking variant of SSENSE exhibited the best performance in terms of Recall@1, Recall@10, and Mean Reciprocal Rank (MRR). In contrast, certain masking strategies, particularly those involving electrode masking, negatively impacted performance, indicating the importance of spatial information preservation.

Implications and Future Directions

The findings have wide-ranging implications in fields such as cognitive neuroscience, language understanding, and brain-computer interfacing. Specifically, the ability to decode semantic language elements from direct brain recordings opens avenues for developing assistive technologies and enhancing our understanding of brain-language dynamics. Future work may focus on scaling this approach to datasets involving multiple subjects, integrating visual stimuli, and leveraging more extensive language models to further improve robustness and accuracy of the decoding process. The potential to extend this framework to real-time settings and more diverse neural datasets remains an exciting frontier for further investigation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.