Decoding speech perception from non-invasive brain recordings (2208.12266v2)
Abstract: Decoding speech from brain activity is a long-awaited goal in both healthcare and neuroscience. Invasive devices have recently led to major milestones in that regard: deep learning algorithms trained on intracranial recordings now start to decode elementary linguistic features (e.g. letters, words, spectrograms). However, extending this approach to natural speech and non-invasive brain recordings remains a major challenge. Here, we introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from the non-invasive recordings of a large cohort of healthy individuals. To evaluate this approach, we curate and integrate four public datasets, encompassing 175 volunteers recorded with magneto- or electro-encephalography (M/EEG), while they listened to short stories and isolated sentences. The results show that our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities on average across participants, and more than 80% in the very best participants - a performance that allows the decoding of words and phrases absent from the training set. The comparison of our model to a variety of baselines highlights the importance of (i) a contrastive objective, (ii) pretrained representations of speech and (iii) a common convolutional architecture simultaneously trained across multiple participants. Finally, the analysis of the decoder's predictions suggests that they primarily depend on lexical and contextual semantic representations. Overall, this effective decoding of perceived speech from non-invasive recordings delineates a promising path to decode language from brain activity, without putting patients at risk for brain surgery.
- Successes and critical failures of neural networks in capturing human-like speech recognition. arXiv preprint arXiv:2204.03740, 2022.
- Brain2word: decoding brain activity for language generation. arXiv preprint arXiv:2009.04765, 2020.
- Explosion AI. spacy. 2017. URL https://spacy.io/.
- Towards reconstructing intelligible speech from the human auditory cortex. Scientific reports, 9(1):1–12, 2019.
- Enhancing the decoding accuracy of eeg signals by the introduction of anchored-stft and adversarial data augmentation method. Scientific reports, 12(1):1–19, 2022.
- Interpretation of convolutional neural networks for speech spectrogram regression from intracranial recordings. Neurocomputing, 342:145–151, 2019a.
- Speech synthesis from ecog using densely connected 3d convolutional neural networks. Journal of neural engineering, 16(3):036019, 2019b.
- Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity. Communications biology, 4(1):1–10, 2021.
- Speech synthesis from neural decoding of spoken sentences. Nature, 568(7753):493–498, 2019.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
- Uncovering the structure of clinical eeg signals with self-supervised learning. Journal of Neural Engineering, 18(4):046020, 2021.
- Phonemizer: Text to phones transcription for multiple languages in python. Journal of Open Source Software, 6(68):3958, 2021. doi: 10.21105/joss.03958. URL https://github.com/bootphon/phonemizer.
- A spelling device for the paralysed. Nature, 398(6725):297–298, 1999.
- Moving magnetoencephalography towards real-world applications with a wearable system. Nature, 555(7698):657–661, 2018.
- Hierarchical structure guides rapid linguistic predictions during naturalistic listening. PloS one, 14(1):e0207741, 2019.
- Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech. Current Biology, 28(5):803–809, 2018.
- Artificial speech synthesizer control by brain-computer interface. In Tenth Annual Conference of the International Speech Communication Association, 2009.
- Language processing in brains and deep neural networks: computational convergence and its limits. BioRxiv, 2020.
- Deep language algorithms predict semantic comprehension from brain activity. Scientific Reports, 12(1):16327, 2022.
- Decoding word and category-specific spatiotemporal representations from meg and eeg. Neuroimage, 54(4):3028–3039, 2011.
- Deep recurrent encoder: A scalable end-to-end network to model brain signals. arXiv preprint arXiv:2103.02339, 2021.
- Detection of brain activation in unresponsive patients with acute brain injury. New England Journal of Medicine, 380(26):2497–2505, 2019.
- Bedside detection of awareness in the vegetative state: a cohort study. The Lancet, 378(9809):2088–2094, 2011.
- Determining the optimal number of meg trials: A machine learning and speech decoding perspective. In Brain Informatics: International Conference, BI 2018, Arlington, TX, USA, December 7–9, 2018, Proceedings 11, pages 163–172. Springer, 2018.
- Decoding speech from single trial meg signals using convolutional neural networks and transfer learning. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 5531–5535. IEEE, 2019.
- Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning, 2017.
- Decoding the information structure underlying the neural representation of concepts. Proceedings of the National Academy of Sciences, 119(6):e2108091119, 2022.
- Transfer learning in imagined speech eeg-based bcis. Biomedical Signal Processing and Control, 50:151–157, 2019.
- Linking artificial and human neural representations of language. arXiv preprint arXiv:1910.01244, 2019.
- Meg and eeg data analysis with mne-python. Frontiers in neuroscience, page 267, 2013.
- Neural dynamics of phoneme sequencing in real speech jointly encode order and invariant content. bioRxiv, 2020.
- Meg-masc: a high-quality magneto-encephalography dataset for evaluating natural speech processing. arXiv preprint arXiv:2208.11488, 2022.
- Magnetoencephalography—theory, instrumentation, and applications to noninvasive studies of the working human brain. Reviews of modern Physics, 65(2):413, 1993.
- Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science, 293(5539):2425–2430, 2001.
- Hyperalignment: Modeling shared information encoded in idiosyncratic cortical topographies. Elife, 9:e56601, 2020.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Brain-to-text: decoding spoken phrases from phone representations in the brain. Frontiers in neuroscience, 9:217, 2015.
- A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, 2019.
- The cortical organization of speech processing. Nature reviews neuroscience, 8(5):393–402, 2007.
- Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications, 8(1):15037, 2017.
- Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600):453–458, 2016.
- The manually annotated sub-corpus: A community resource for and by the people. In Proceedings of the ACL 2010 conference short papers, pages 68–73, 2010.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. Technical Report 1502.03167, arXiv, 2015.
- Covert speech vs. motor imagery: a comparative study of class separability in identical environments. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 2020–2023. IEEE, 2018.
- Autoreject: Automated artifact rejection for meg and eeg data. NeuroImage, 159:417–429, 2017.
- Moabb: trustworthy algorithm benchmarking for bcis. Journal of neural engineering, 15(6):066011, 2018.
- Decoding the visual and subjective contents of the human brain. Nature neuroscience, 8(5):679–685, 2005.
- Slow firing single units are essential for optimal decoding of silent speech. 2022.
- Single-trial decoding of auditory novelty responses facilitates the detection of residual consciousness. Neuroimage, 83:726–738, 2013.
- Encoding and decoding neuronal dynamics: Methodological framework to uncover the algorithms of cognition. 2018.
- Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
- Synthesizing speech from intracranial depth electrodes using an encoder-decoder framework. arXiv preprint arXiv:2111.01457, 2021.
- Development of a cognitive brain-machine interface based on a visual imagery method. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 1062–1065. IEEE, 2018.
- Transformer-based estimation of spoken sentences using electrocorticography. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1311–1315. IEEE, 2022.
- Speech synthesis using eeg. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1235–1238. IEEE, 2020.
- Brain–computer communication: Unlocking the locked in. Psychological bulletin, 127(3):358, 2001.
- Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces. Journal of neural engineering, 15(5):056013, 2018.
- Alessandro Lopopolo and Antal van den Bosch. Part-of-speech classification from magnetoencephalography data using 1-dimensional convolutional neural network. 2020.
- Word pair classification during imagined speech using direct brain recordings. Scientific reports, 6(1):1–12, 2016.
- Paul Mermelstein. Distance measures for speech recognition, psychological and instrumental. Pattern recognition and artificial intelligence, 116:374–388, 1976.
- Generalizable spelling using a speech neuroprosthesis in an individual with severe limb and vocal paralysis. Nature Communications, 13(1):6510, 2022.
- Do self-supervised speech models develop human-like perception biases? arXiv preprint arXiv:2205.15819, 2022.
- Toward a realistic model of speech processing in the brain with self-supervised learning. arXiv preprint arXiv:2206.01685, 2022.
- Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron, 60(5):915–929, 2008.
- Classification of auditory stimuli from eeg signals with a regulated recurrent neural network reservoir. arXiv preprint arXiv:1804.10322, 2018.
- Neuroprosthesis for decoding speech in a paralyzed person with anarthria. New England Journal of Medicine, 385(3):217–227, 2021.
- Decoding part-of-speech from human eeg signals. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2201–2210, 2022.
- Inferring imagined speech using eeg signals: a new approach using riemannian manifold features. Journal of neural engineering, 15(1):016002, 2017.
- Reconstructing visual experiences from brain activity evoked by natural movies. Current biology, 21(19):1641–1646, 2011.
- Speech imagery decoding as a window to speech planning and production. bioRxiv, pages 2022–05, 2022.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019. URL https://github.com/pytorch/fairseq/blob/main/examples/wav2vec.
- Detecting awareness in the vegetative state. science, 313(5792):1402–1402, 2006.
- Brain-diffuser: Natural scene reconstruction from fmri signals using generative latent diffusion. arXiv preprint arXiv:2303.05334, 2023.
- Decoding covert speech from eeg-a comprehensive review. Frontiers in Neuroscience, 15:392, 2021.
- Improving brain decoding methods and evaluation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1476–1480. IEEE, 2022.
- Reconstructing speech from human auditory cortex. PLoS biology, 10(1):e1001251, 2012.
- Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011. URL https://scikit-learn.org/.
- Decoding vowels and consonants in spoken and imagined words using electrocorticographic signals in humans. Journal of neural engineering, 8(4):046028, 2011.
- Estimated prevalence of the target population for brain-computer interface neurotechnology in the netherlands. Neurorehabilitation and neural repair, 31(7):677–685, 2017.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Deep learning-based electroencephalography analysis: a systematic review. Journal of neural engineering, 16(5):051001, 2019.
- Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping, 38(11):5391–5420, 2017.
- A 204-subject multimodal neuroimaging dataset to study language processing. Scientific Data, 6(1):17, April 2019. ISSN 2052-4463. doi: 10.1038/s41597-019-0020-y. URL https://data.donders.ru.nl/collections/di/dccn/DSC_3011020.09_236?0.
- Learning joint multilingual sentence representations with neural machine translation. arXiv preprint arXiv:1704.04154, 2017. URL https://github.com/facebookresearch/laser.
- Robyn Speer. rspeer/wordfreq: v3.0. September 2022. doi: 10.5281/zenodo.7199437. URL https://doi.org/10.5281/zenodo.7199437.
- R Anandha Sree and A Kavitha. Vowel classification from imagined speech using sub-band eeg frequencies and deep belief networks. In 2017 fourth international conference on signal processing, communication and networking (ICSCN), pages 1–4. IEEE, 2017.
- Demographics of rehabilitation robotics users. Technology and Disability, 5(2):125–137, 1996.
- Decoding speech from intracortical multielectrode arrays in dorsal “arm/hand areas” of human motor cortex. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 93–97. IEEE, 2018.
- Neural networks based eeg-speech models. arXiv preprint arXiv:1612.05369, 2016.
- Semantic reconstruction of continuous language from non-invasive brain recordings. bioRxiv, pages 2022–09, 2022.
- Self-supervised learning of brain dynamics from broad neuroimaging data. arXiv preprint arXiv:2206.11417, 2022.
- Self-supervised models of audio effectively explain human cortical responses to speech. arXiv preprint arXiv:2205.14252, 2022.
- High-performance brain-to-text communication via handwriting. Nature, 593(7858):249–254, 2021.
- Regularized hyperalignment of multi-set fmri data. In 2012 IEEE Statistical Signal Processing Workshop (SSP), pages 229–232. IEEE, 2012.
- Torchaudio: Building blocks for audio and speech processing. arXiv preprint arXiv:2110.15018, 2021.
- The htk book. Cambridge university engineering department, 3(175):12, 2002.