Decoding speech perception from non-invasive brain recordings (2208.12266v2)

Published 25 Aug 2022 in eess.AS, cs.AI, cs.LG, and q-bio.NC

Abstract: Decoding speech from brain activity is a long-awaited goal in both healthcare and neuroscience. Invasive devices have recently led to major milestones in that regard: deep learning algorithms trained on intracranial recordings now start to decode elementary linguistic features (e.g. letters, words, spectrograms). However, extending this approach to natural speech and non-invasive brain recordings remains a major challenge. Here, we introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from the non-invasive recordings of a large cohort of healthy individuals. To evaluate this approach, we curate and integrate four public datasets, encompassing 175 volunteers recorded with magneto- or electro-encephalography (M/EEG), while they listened to short stories and isolated sentences. The results show that our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities on average across participants, and more than 80% in the very best participants - a performance that allows the decoding of words and phrases absent from the training set. The comparison of our model to a variety of baselines highlights the importance of (i) a contrastive objective, (ii) pretrained representations of speech and (iii) a common convolutional architecture simultaneously trained across multiple participants. Finally, the analysis of the decoder's predictions suggests that they primarily depend on lexical and contextual semantic representations. Overall, this effective decoding of perceived speech from non-invasive recordings delineates a promising path to decode language from brain activity, without putting patients at risk for brain surgery.

References (96)

Citations (97)

View on Semantic Scholar

Summary

The paper introduces a novel contrastive learning framework using subject-specific CNNs that decodes speech perception with a top-10 accuracy of 72.5% on MEG data.
The methodology leverages diverse MEG/EEG recordings from 175 participants combined with pretrained wav2vec 2.0 speech representations to boost decoding precision.
The findings pave the way for advanced non-invasive BCIs and offer promising communication aids for individuals with speech disorders.

Decoding Speech Perception from Non-Invasive Brain Recordings

The paper "Decoding speech perception from non-invasive brain recordings" presents a significant advancement in the field of Brain-Computer Interface (BCI) technology, specifically focusing on decoding speech from non-invasive brain recordings. The researchers examine an innovative approach to decode perceived speech using machine learning techniques on magnetoencephalography (MEG) and electroencephalography (EEG) data. This paper not only holds implications for the development of communication aids for individuals with speech deficits due to injury or disease but also provides insights into the neural representations of language.

Methodology and Model Architecture

The researchers employ a contrastive learning framework to decode speech signals. This approach differs from traditional supervised learning models typically employed in neural decoding tasks. The contrastive learning approach focuses on maximizing the discrimination power between speech segments in the dataset. The paper utilizes a convolutional neural network (CNN) architecture with a subject-specific layer designed to accommodate inter-individual variability in brain responses. This CNN is tasked with predicting self-supervised deep representations of speech obtained from wav2vec 2.0, a model pre-trained on a large corpus of speech data.

Data and Experimental Design

To train and evaluate the new model, the authors curated a diverse dataset amalgamated from four different public datasets. These datasets included MEG and EEG recordings from a collective total of 175 participants who listened to a series of spoken sentences and stories. The researchers intentionally selected this data to encompass variations in linguistic content. The paper emphasizes speech perception, making a significant effort to train models with sufficient data diversity, critical to improving the robustness and generalizability of the model.

Results and Performance Analysis

The model achieved remarkable results, capable of selecting the correct 3-second speech segment with a top-10 accuracy of 72.5% using MEG data and 19.1% with EEG data. In the most successful participants, the top-1 accuracy exceeded 80%, demonstrating the model's ability to decode fine-grained linguistic features from brain recordings. Additionally, the use of a pretrained speech representation model, wav2vec 2.0, showed superior performance compared to models that were trained with simpler spectrogram representations.

Implications and Future Directions

This research underscores the importance of leveraging self-supervised speech representations in decoding brain activity. It holds practical applications in providing communication assistance for individuals who are unable to speak due to neurological disorders. Moreover, it suggests that linguistic representation in the brain may be more effectively decoded via high-level contextual models than by simply relying on raw acoustic features.

While the results predominantly favored MEG over EEG, indicating the former's higher sensitivity and spatial resolution, the paper lays the groundwork for improvements in non-invasive BCI. The implications suggest future directions in developing robust, efficient, and patient-friendly speech decoding systems that leverage large-scale deep learning models. Given the growing accessibility of wearable MEG devices, this trajectory could soon translate into tangible advancements in clinical and personal use BCIs.

In sum, this paper introduces a promising avenue for decoding spoken language perception with non-invasive techniques, offering a basis for future research in extending these techniques to speech production scenarios. This represents a crucial step toward achieving efficient, scalable, and non-invasive communication solutions for those with debilitating speech disorders.

PDF Markdown

YouTube

Show All Videos