Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (2405.11459v3)
Abstract: Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications, but less damaging methods like intracranial stereo-electroencephalography (sEEG) remain underexplored. With rapid advances in representation learning, leveraging abundant recordings to enhance speech decoding is increasingly attractive. However, popular methods often pre-train temporal models based on brain-level tokens, overlooking that brain activities in different regions are highly desynchronized during tasks. Alternatively, they pre-train spatial-temporal models based on channel-level tokens but fail to evaluate them on challenging tasks like speech decoding, which requires intricate processing in specific language-related areas. To address this issue, we collected a well-annotated Chinese word-reading sEEG dataset targeting language-related brain networks from 12 subjects. Using this benchmark, we developed the Du-IN model, which extracts contextual embeddings based on region-level tokens through discrete codex-guided mask modeling. Our model achieves state-of-the-art performance on the 61-word classification task, surpassing all baselines. Model comparisons and ablation studies reveal that our design choices, including (i) temporal modeling based on region-level tokens by utilizing 1D depthwise convolution to fuse channels in the ventral sensorimotor cortex (vSMC) and superior temporal gyrus (STG) and (ii) self-supervision through discrete codex-guided mask modeling, significantly contribute to this performance. Overall, our approach -- inspired by neuroscience findings and capitalizing on region-level representations from specific brain regions -- is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.
- Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity. Communications biology, 4(1):1055, 2021.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Human brain language areas identified by functional magnetic resonance imaging. Journal of Neuroscience, 17(1):353–362, 1997.
- Functional organization of human sensorimotor cortex for speech articulation. Nature, 495(7441):327–332, 2013.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron, 98(5):1042–1054, 2018.
- Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
- Eegformer: Towards transferable and interpretable large-scale eeg foundation model. arXiv preprint arXiv:2401.10278, 2024.
- Neural latent aligner: cross-trial alignment for learning representations of complex, naturalistic neural data. In International Conference on Machine Learning, pages 5661–5676. PMLR, 2023.
- Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- The control of vocal pitch in human laryngeal motor cortex. Cell, 174(1):21–31, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Dewave: Discrete eeg waves encoding for brain dynamics to text translation. arXiv preprint arXiv:2309.14030, 2023.
- A high-performance brain-sentence communication designed for logosyllabic language. bioRxiv, pages 2023–11, 2023.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- Large brain model for learning generic representations with tremendous EEG data in BCI. In The Twelfth International Conference on Learning Representations, 2024.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Bendr: using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data. Frontiers in Human Neuroscience, 15:653659, 2021.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Optimal referencing for stereo-electroencephalographic (seeg) recordings. NeuroImage, 183:327–335, 2018.
- itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625, 2023.
- Bstt: A bayesian spatial-temporal transformer for sleep staging. In The Eleventh International Conference on Learning Representations, 2022.
- A high-performance neuroprosthesis for speech decoding and avatar control. Nature, 620(7976):1037–1046, 2023.
- Neuroprosthesis for decoding speech in a paralyzed person with anarthria. New England Journal of Medicine, 385(3):217–227, 2021.
- BEiT v2: Masked image modeling with vector-quantized visual tokenizers. 2022.
- Improving language understanding by generative pre-training. 2018.
- Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
- A deep learning framework for neuroscience. Nature neuroscience, 22(11):1761–1770, 2019.
- If deep learning is the answer, what is the question? Nature Reviews Neuroscience, 22(1):55–67, 2021.
- The cortical maps of hierarchical linguistic structures during speech perception. Cerebral cortex, 29(8):3232–3240, 2019.
- Eeg conformer: Convolutional transformer for eeg decoding and visualization. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719, 2022.
- Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, 26(5):858–866, 2023.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Brainbert: Self-supervised representation learning for intracranial recordings. arXiv preprint arXiv:2302.14367, 2023.
- Timesnet: Temporal 2d-variation modeling for general time series analysis. In The eleventh international conference on learning representations, 2022.
- Learning topology-agnostic eeg representations with geometry-aware modeling. Advances in Neural Information Processing Systems, 36, 2024.
- Brant: Foundation model for intracranial neural signal. Advances in Neural Information Processing Systems, 36, 2024.
- Universal sleep decoder: Aligning awake and sleep neural representation across subjects. arXiv preprint arXiv:2309.16457, 2023.